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GENERATION AND SELECTION OF NOVEL DNA-BINDING 
PROTEINS AND POLYPEPTIDES 

5 

BACKGROUND OF THE INVENTION 
Field of the Inven-fcion 

10 This invention relates to development of novel DNA- 

binding proteins and polypeptides by an iterative process 
of mutation, expression, selection, and amplification. The 
ability to create novel DNA-binding proteins will have far- 
reaching applications, including, but not limited to, use 

15 in: a) treating viral diseases, b) treating genetic 
diseases, c) preparation of novel biochemical reagents, and 
d) biotechnology to regulate gene expression in cell 
cultures. Several workers have shown that repressors 
derived from bacteria function when expressed in eukaryotic 

20 cells (BREN84, FIGG88, BR0W87, HUMC87 , HUMC88), but none 
have shown how to generate proteins that bind sequence- 
specifically to a predetermined DNA sequence. For reviews 
of transcriptional control in eukaryotic cells, see STRU87, 
JONE87, and MANI87* The present application deals only 

25 with sequence-specific DNA-binding proteins, abbreviated 
DBF. 

Proteins, particularly repressors, having affinity for 
specific sites on DNA modulate transcription of ^ genes ♦ 

3 0 The best known are a group of proteins primarily studied in 
prokaryotes that contain the structural motif alpha- 
helix-turn-alpha-helix (H-T-H) (SAUE82, PAB084) • These 
proteins bind as dimers or tetramers to DNA at specific 
operator sequences that have approximately palindromic 

3 5 sequences. Contacts made by two adjacent alpha helices of 
each monomer in and around two sites in the major groove of 
B-form DNA are a major feature in the DNA-protein inter- 
face* This group of proteins includes phage repressor and 
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Cro proteins, bacterial metabolic repressors such as GalR, 
Lad, LexA, and TrpR, bacterial activator protein CAP and 
activator/repressor AraC, bacterial transposon and plasmid 
TetR proteins (PAB084) , the yeast mating type regulators 
5 MATal and MATalpha2 (MILL85) and eukaryotic homeo box 
proteins (EVANS 8) . 



Interactions between dimeric repressors and approx- 
imately palindromic operators have usually been discussed 

10 in the literature with attention focused on one half of the 
operator with the tacit or explicit assumption that 
identical interactions occur in each half of the complex. 
Departures from palindromic symmetry allow proteins to 
distinguish among multiple related operators (SADL83, 

15 SIM084). one must view the DNA-protein interface as 

whole. The emphasis in the literature on dyad symmetry is 
a barrier to determining the requirements for general novel 
recognition of DNA by proteins. 



20 



30 



35 



a 



The equilibrium geometry and flexibility of DNA are 
determined by the sequence; see inter alia HOGA87, GART88, 
and UIAN87. The interactions of ionic, polar, and hydro- 
phobic groups on the DNA with solvent molecules and ions 
make detailed predictions of DNA conformation and binding 
25 properties very difficult; cf^ OHLE85, ULANSV, and OTWI88. 

Matthews (MATT88) , commenting on the current collec- 
tion of protein-DNA structures, concludes that: a) 
different H-T-H DBFs use their recognition helices differ- 
ently, - b) there is no simple code that relates particular 
base pairs to particular amino acids at specific locations 
in the DBP, and c) "full appreciation of the complexity 
and individuality of each complex will ' be discouraging to 
anyone hoping to find simple answers to the recognition 
problem." Schleif (SCHL88) has characterized the study of 
DNA-binding proteins as -a field still in its infancy and 
emphasizes the difficulties of designing proteins that bind 
predetermined sequences . . . 
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Prokaryotic repressors exist that are unrelated to 
H-T-H binding proteins. Some of these bind to approximate 
palindromic sequences ( e.g. Salmonella typhimurium phage 
5 P22 Mnt protein (VERS87a) and coli TyrR repressor 

protein (DEFE86) ) . Others bind to operator sequences that 
are partially symmetric (S_^ typhimurium phage P22 Arc 
protein, VERS87b; E^ coli Fur protein, DEL087; plasmid R6K 
pi protein, FILU85) or non-symmetric (phage Mu repressor, 
10 KRAU86) . 

Genetics has enabled extensive analysis of prokaryotic 
DNA-binding proteins and their specific nucleic acid 
recognition sequences. It is not yet possible, however, to 
15 design a protein to bind strongly and specifically to an 
arbitrary DNA sequence. As taught by the present invention 
it is, nonetheless, possible to postulate a family of 
potential DBP mutants and identify one having the desired 
specificity by other means. 

20 

Genetic studies of the DNA-binding proteins show that 
mutations in protein sequence that result in decrease of 
protein function fall into two overlapping classes: 1) 
those that destabilize the global protein structure or 

25 folding and 2) those that specifically alter the binding 
properties. The first class illuminates the general 
problem of protein folding and stability, while the second 
defines the interactions involved in the formation and 
stabilization of the protein-DNA complex. Mutations in the 

3 0 operator yield additional information. 

Positions 84 to 91 in helix 5 of X repressor have been 
subjected to extensive amino -acid: substitutions (REID88) . 
Two or three positions were varied simultaneously through 
3 5 all twenty amino acids and those combinations giving normal 
function were selected. The authors neither discuss 
optimization of the number or positions of residues to vary 
to obtain any particular functionality, nor did they 
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attempt to obtain proteins having alternate dimerization or 
recognition functions. 



Cro. 



Pakula et ^ (PAKU86) have randomly mutagenized X 
They sought and found non-functional mutants but did 
not seek or find proteins having novel DNA-binding proper- 
ties, nor did they suggest how to select such proteins. 

sequence-independent DNA-protein interactions are 
thought to occur via electrostatic interactions between the 
backbone of the DNA and charged or polar groups of the 
protein (ANDE87, LEWI83, and TAKE85) . Sequence-specific 
interactions involve H-bonding, nonpolar, or van der Waals 
contacts between exposed side groups or groups of the 
polypeptide main chain and base pair edges exposed in the 
major and minor grooves of the DNA. 

Mutations that alter residues involved in specific 
binding interactions with DNA have been identified in 
prokaryotic DBPs, including X, 434, and P22 repressor and 
cro proteins, P22 Arc and Mnt, and coli tr^ and lac 
repressors and CAP. These mutations occur in residues that 
are exposed to solvent in the free protein but buried in 
the protein-DNA complex. 

A few cases have been reported (BASS88, YOUD83, 
VERS85a, CARU87, WHAR85b, WHAR87, EBRI84 , and SPIR88) in 
which a change in one or a few residues in a DNA-binding 
protein not only abolishes binding by the protein to the 
wild-type operator but also confers strong binding to a 
different operator. m all the cited publications, 
alteration of binding specificity has been accomplished by 
using symmetrically-located pairs^ of alterations in the 
operator sites. Single, asymmetric changes or multiple 
changes - asymmetrically located in either the binding 
protein or its operator were not considered. -In "helix 
swap" , .experiments (WHAR84, WHAR85b, WHARBSa, 'a SPIR88 
BUSH88, PAB084)> multiple mutations are introduced into the 
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DNA-binding recognition helix of H-T-H proteins with the 
goal of changing the operator specificity of one known DBP 
to that of a different known DBP. 

5 An extension of the "helix swap" experiments uses a 

mixture of 434 repressor and 434R[alpha3 {P22R) ] (HOLL88) . 
This mixture recognizes and binds in vitro with high 
affinity to a 16 bp chimeric operator comprising a 434 
half-site and a P22 half-site, indicating that active 
10 heterodimers are formed. The authors did not extend the 
results to intracellular repression, nor did they perform 
mutagenesis of the repressors and selection of cells to 
create novel recognition patterns. 

15 Two approaches have been developed to create novel 

proteins through reverse genetics. In one approach, 
dubbed "protein surgery" (DILL87) , a substitution is 
introduced at a single protein residue. This approach has 
been used to determine the effects on structure and 

20 function of specific substitutions in trypsin (CRAI85, 
RAOS87 , BASH87) . 

The other approach has been to generate a variety of 
mutants at many .loci within the cloned gene, the "gene- 

2 5 directed random mutagenesis" method. The specific location 

and nature of the change or changes are determined post hoc 
by DNA sequencing. If loss of a wild-type function confers 
a cellular phenotype, one screens colonies for mutations; 
see, cf PAKU86. This approach is limited by the number of 

3 0 colonies that can be examined. An additional important 

limitation is that many desirable protein alterations 
require multiple amino acid substitutions and thus are not 
accessible through single base changes or even through all 
possible amino acid substitutions at any one residue. 

35 

The objective in both these approaches has been, 
however, to analyze the effects of a variety of point 
mutations, so that rules governing such substitutions could 
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be developed (ULME83) . Progress has been hampered by the 
efforts involved in using either method (ROBES 6) . 

Oliphant et al^ (OLIP86) and oliphant and Struhl 
5 (OLIP87) have demonstrated ligation and cloning of highly 
degenerate oligonucleotides and have applied saturation 
mutagenesis to the study of promoter sequence and function. 
They , have suggested that similar methods could be used to 
study genetic expression of protein coding regions of 
10 genes, but they do not say how one should: a) choose 
protein residues to vary, or b) select or screen mutants 
with desirable properties. 

Ward et al^ (WARD86) have engineered heterodimers 
15 from homodimers of tyrosyl-tRNA synthetase. Methods of 
converting homodimeric DBPs into heterodimeric DBPs are 
disclosed in the present invention. Methods of deriving 
single-polypeptide pseudo-dimeric DBPs from homodimeric 
DBPs are disclosed in the examples of the present inven- 
20 tion. 

Benson et al^ (BENS8 6) have developed a scheme to 
detect genes for sequence-specific DNA-binding proteins. 
They do not consider non-symmetric target DNA sequences nor 
do they suggest mutagenesis to generate novel DNA-binding 
properties. Their method is presented as a method to 
detect genes for naturally occurring DNA-binding proteins. 
Because the selective system is lytic growth of phage, low 
levels of repression can not be detected. Selective 
chemicals, as disclosed in the present application, on the 
other hand, can be finely modulated so that low level 
repression is detectable. 



25 



30 



35 



Elledge and Davis (ELLE89a) and Elledge et al. 
CELLE89b) have used an occluded aadA gene in a select ion" 
for cells expressing eukaryotic DBPs. ^ The supposed 
recognition sequence of the sought DBP is incorporated into 
the strong promoter that occludes aadA on a :low-copy number 
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plasmid. Their system is presented as a tool for cloning 
pre-existing DBFs and there is no mention of variegation of 
the gene that encodes the potential DBP. Furthermore, 
there is no discussion of the symmetry of the target 
5 sequence or of the symmetry of the DEP. 

Ladner and Bird, WO88/06601 suggest strategies for the 
preparation of asymmetric repressors. In one embodiment, a 
gene is constructed that encodes, as a single polypeptide 

10 chain, the two DNA-binding domains of a naturally-occurring 
dimeric repressor, joined by a polypeptide linker that 
holds the two binding domains in the necessary spatial 
relationship for binding to an operator. While they prefer 
to design the linker based on protein structural data (cf^ 

15 Ladner, U.S. Patent 4,704,692) they state that uncertain- 
ties in the design of the linker may be resolved by 
generating a family of synthetic genes, differing in the 
linker-encoding subsequence, and selecting in vivo for a 
gene encoding the desired pseudo-dimer . Ladner and Bird do 

2 0 not consider the background of false positives that would 
arise if the two-domain polypeptides dimerize to form 
pseudo-tetramers . 

The binding of lambdoid repressors, Cro and CI re- 

2 5 pressor, is taken, in WO88/06601, as canonical even though 

other DBPs were known having operators of different 
lengths. WO88/06601 maintains that the 17 bp lambdoid 
operators can be divided into three regions: a) a left arm 
of five bases, b) a central region of seven bases, and c) a 

3 0 right arm of five bases. Several other DBPs are known for 

which this division is inappropriate. Further, WO88/06601 
states that the sequence and composition of the central 
region, in which edges of bases are not contacted by the 
DBP, are immaterial. There is direct evidence for 434 
35 repressor (KOUD87, KOUD88) that the sequence and composi- 
tion of the central region strongly influences binding of 
4 34 repressor. 
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once a pseudo-dimer is obtained, they then obtain an 
asymmetric pseudo-dimer by the following technique. 
Fxrst, the user of WO88/06601 is directed to construct a 
famxly of hybrid operators in which the sequence of the 
5 left and right arms are specified; no specification is 
gxven for the central seven bases. m each member of the 
famxly, the left arm contains the same sequence as the 
wild-type operator left arm while the right arm 5-mer is 
systematically varied through all 1024 possibilities. 
10 Similarly, in the gene encoding the pseudodimer, the codons 
for one recognition helix have the wild-type sequence while 
the codons coding for the other recognition helix are 
highly varied. The variegated pseudodimer genes are 
expressed in bacterial cells, wherein the hybrid operators 
15 are positioned to repress a single highly deleterious gene. 
Thus, It is supposed that one can identify a recognition 
helix for each possible 5-mer right arm of the operator by 
in vxvo selection; the correspondences between 5-mer right 
arms and sequences of recognition helices are compiled 
20 in.o a dictionary. The consequences of mutations or 
deletions in the deleterious genes are not considered. 
WO88/06601 suggests that successful constructions may be 
very rare, e^ one in lo6, but ignore other genetic 
events of similar or greater frequency. 

25 

To obtain a repressor for an arbitrary i7-mer operat- 
or, the user of W088/06601: 



30 



35 



a) finds the 5-mer sequence of the left arm in the 
dictionary and uses the corresponding recognition 
helix sequence in the first DNA-binding domain of the 
pseudodimer. 



b) 



ianores the sequence and composition of the. next seven 
bases, and 



c) finds the 5-mer sequence of the right -arm in the 
dictionary and uses the corresponding recognition' 

" 1 
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helix sequence in the second DNA-binding domain of the 
pseudodimer . 



WO88/06G01 also envisions means for prodiicing a 
5 heterodimeric repressor. A plasraid is provided that 
carries genes encoding two different repressors. A 
population of such plasmids is generated in which some 
codons are varied in each gene. WO88/06601 instructs the 
user to introduce very high levels of variegation without 

10 regard to the number of independent transf ormants that can 
be produced. WO88/06601 also instructs the user to 
introduce variegation at widely separated sites in the 
gene, though there is no teaching concerning ways to 
simultaneously introduce high levels of variegation at 

15 widely separated sites in the gene or concerning main- 
tenance of diversity without selective pressure, as would 
be needed if the variegation were introduced stepwise. 
WO88/06601 teaches that codons thought to be involved in 
the protein-protein interface should be preferentially 

2 0 mutated to generate heterodimers . Cells transformed with 
this population of plasmids will produce both the desired 
heterodimer and the two "wild-type" homodimers. W088/ 06601 
advises that one select for production of the heterodimer 
by providing a highly deleterious gene controlled by a 

2 5 hybrid operator, and beneficial genes controlled by the 
wild-type operators. The fastest growing cells, it is 
taught, will be those that produce a great deal of the 
heterodimer (which blocks expression of the deleterious 
gene), and little of the homodimers (so that the beneficial 

30 genes are more fully expressed) . There is no consideration 
of mutations or deletions in the deleterious gene or in the 
wild-type operators; such mutations will produce a back- 
ground of fast-growing cells that do not contain the 
desired heterodimers. 



35 
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SDMMARY OF THE INVENTION 



Thxs invention relates to the development of novel 
proteins or polypeptides that preferentially bind to a 
specific subsequence of double-stranded DNA (the "target") 
Which need not be symmetric, using a novel scheme for in 
VIVO selection of mutant proteins exhibiting the desired 
binding specificities. 

The novel binding proteins or polypeptides may be 
Obtained by mutating a gene encoding on expression: i) a 
known DNA-binding protein within the subsequence encoding a 
known DNA-binding domain, 2) a protein that, while not 
possessing a known DNA-binding activity, possesses a 
secondary or higher order structure that lends itself to 
binding activity (clefts, grooves, helices, ^) , 3) a 
known DNA-binding protein but not in the subsequence known 
to cause the binding, or 4) a polypeptide having no known 
3D structure of its own. 



This application uses the term "variegated DNA" to 
refer to a population of molecules that have the same base 
sequence through most of their length, but that vary at a 
number of defined loci. Using standard genetic engineering 
techniques, variegated DNA can be introduced into a plasmid 
so that it constitutes part of a gene (OLIP86, OLIP87 
CHEN88, AUSU87, REID88) . When plasmids containing varie^ 
gated DNA are used to transform bacteria, each cell makes a 
version of the original protein. Each colony of bacteria 
3 0 produces a different version from most other colonies if 
the variegations of the DNA are concentrated at loci that 
code on expression for residues known to be on the surface 
Of the protein or in loops, a population of genes will be 
generated that code on expression for a population of 
35 proteins, many members of which will fold into roughly the" 
same 3D structure as the parental protein. Most often we 
generate mutations that are concentrated within codons for 
residues thought to make contact with the DNA. Secondari- 
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ly, we introduce mutations into codons specifying residues 
that are not directly involved in DNA contact but that 
affect the position or dynamics of residues that do contact 
the DNA. 

5 

In general, a variegated population of DNA molecules, 
each of which encodes one of a large ( e>Q. 10*^) number of 
distinct potential target-binding proteins, is used to 
transform a cell culture. The cells of this cell culture 

10 are engineered with binding marker genetic elements so 
that, under selective conditions, the cell thrives only if 
the expressed potential target-binding protein in fact 
binds to the target subsequence preventing transcription of 
these binding marker genetic elements. (Typically, binding 

15 of a successful target-binding protein to the target 
subsequence blocks expression of a gene product that is 
deleterious under selective conditions. Alternatively, 
binding of a successful target-binding protein can inacti- 
vate a strong promoter that otherwise occludes transcrip- 

20 tiori of a beneficial gene.) The mutant cells are directed 
to express the potential target-binding proteins and the 
selective conditions are applied. Cells expressing 

proteins binding successfully to the target are thus 
identified by in vivo selection. If the binding character- 

25 istics are not fully satisfactory, the amino acid sequences 
of the best binding proteins are determined (usually by 
sequencing the corresponding genes) , a new population of 
DNA molecules is synthesized that encode variegated forms 
of th,e best binding proteins of the last cull, mutant cells 

3 0 are prepared, the new population of potential DNA-binding 
proteins is expressed, and the best proteins are once again 
identified by the superior growth of the corresponding 
transfonuants under selective conditions. The process is 
repeated until a protein or polypeptide with the desired 

3 5 binding characteristics is obtained. Its corresponding 
gene may then be moved to a suitable expression system. 



10 
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in the simplest form of this invention, the mutant 
cells are provided with a selectable genetic element, the 
transcription of which is deleterious to the survival or 
growth Of the cell. The selectable genetic element either 
5 xs a promoter or is operably linked to a promoter regulat- 
ing the expression of the gene. The promoter, or other 
non-coding region of the genetic element (for example, an 
xntron), has been modified to include the desired target 
subsequence in a position where it will not interfere with 
transcription of the selectable gene unless a protein binds 
to that target subsequence. Each mutant cell is also 
provided with a gene encoding on expression a potential 
DNA-binding protein, operably linked to a promoter that is 
preferably regulated by a chemical inducer. When this gene 
15 xs expressed, the potential DNA-binding protein has the 
opportunity to bind to the target and thereby protect the 
cell from the selective conditions under which the product 
Of the binding marker gene would otherwise harm the cell. 

in addition to the desired outcome of these in vivo 
selections, there exist a number of possible genetic events 
that allow the cells to escape the selection, producing 
artxfacts and inefficiency by allowing the growth of 
colonies that do not express the desired sequence-specific 
DNA-bxnding proteins. Examples of mechanisms, other than 
the desired outcome, that lead to cell survival under the 
selective conditions include: a) a point mutation or a 
deletion in the selectable gene eliminates expression or 
function Of the selectable gene product; b) a host chromo- 
somal mutation compensates for or suppresses function of 
the selectable gene product; c) the introduced potential 
DNA-binding protein binds to a DNA subsequence other than 
the chosen target subsequence and blocks expression of the 
selectable gene; d) the introduced potential DNA-binding 
protexn binds to and inactivates the gene product of the 
selective gene; and e) a- DNA-binding protein endogenous to 
the host mutates so that it binds to the selectable gene 
and blocks expression of the selectable gene. 
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This invention relates, in particular, to the design 
of a vector that confers upon the host cells the desired 
conditional sensitivity to the selection conditions in such 
5 a manner as to greatly reduce the likelihood of false 
positives and artif actual colonies • 

First, at least two selectable genes that are func- 
tionally unrelated are used to reduce the risk that a 

10 single point mutation in the vector (or in the host 
chromosome) will destroy the sensitivity of the cell to 
the selective conditions, since it will eliminate only one 
of the two (or more) deleterious phenotypes. Similarly, a 
single introduced gene for a potential DNA-binding protein 

15 that binds to and inactivates the gene product of one 
selectable gene will not bind and inactivate the gene 
product of the other selectable gene. The likelihood that 
point mutations will occur in both selectable genes or that 
two host chromosomal mutations will spontaneously arise 

2 0 that suppress the effects of two genes is the product of 
each single individual probabilities of the necessary 
event , and thus is extremely low . 

The DNA sequences of the two or more selectable genes 

2 5 preferably should not have long segments of identity: a) to 

avoid isolation of a DBP that binds these identical regions 
instead of the intended target sequence, and b) to reduce 
the likelihood of genetic recombination. The degeneracy of 
the genetic code allows us to avoid exact identity of more 

3 0 than a few, e.g. 10, bases, 

Second, the selectable genes are placed on the vector 
in alternation with genetic elements that are essential to 
plasmid maintenance. Thus, a single deletion event, even 
3 5 of thousands of bases, cannot eliminate both selectable 
genes without also eliminating vital genetic elements. 
Alternatively, the selectable genes are placed in the 
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bacterial chromosome. Spontaneous deletions from the 
chromosome are rare. 

Third, different promoters are associated with each 
5 of the selectable genes. This ensures that the selection 
does not isolate cells harboring genes encoding on expres- 
sion novel DNA-binding proteins that bind specifically to 
subsequences that are part of the promoter but not the 
chosen target subsequence. Each cell expresses only one or 
10 a few introduced potential DNA-binding proteins (multiple 
potential DNA-binding proteins could arise if one cell is 
transformed by two or more variegated plasmids) . The 
probability that two such proteins will occur in one cell 
and that one will bind to the promoter of the first 
selectable gene and that the second will bind to the 
different promoter of the second selectable gene is very 
small. 



15 



20 



25 



30 



35 



Fourth, the selectable binding marker genes may be 
placed on a vector different from the vector that carries 
the potential dbp gene. DNA manipulations that introduce 
variegation into the potential db^ gene can cause mutations 
an the vector remote from the site of the intended muta- 
tions. Thus, we may place the selectable binding marker 
genes in the bacterial chromosome or on a separate plasmid 
that is compatrible with the dbp vector. 

Finally, the same promoter is used to initiate 
transcription of two genes: a) one of the deleterious 
selectable binding marker genes, and b) a beneficial or 
essential gene also borne on the plasmid and used to 
select for uptake and maintenance of the plasmid (e^ an 
antibiotic resistance gene, such as bla) In ttie case of 
the beneficial or essential gene, however, there- is no 
•instance of the predetermined target DNA -subsequefiae 
associated with the promoter. Thus, ■'if a DNA-binding 
protein binds to a subsequence of the promotef^other than 
the predetermined target DNA subsequence, it will frustrate 
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expression of the beneficial or essential one. If desired, 
more than one such beneficial or essential gene may be 
provided. In that event, copies of promoter A may be 
operably linked to both deleterious gene A' (with an 
5 instance of the target) and beneficial gene A" (without an 
instance of the target) , while copies of promoter B are 
operably linked to both deleterious gene B» (with target) 
and beneficial gene B" (without target) . 

10 The selection system described above is a powerful 

tool that eliminates most of the artifacts associated with 
selections based on cloning vectors that use a single 
selectable gene or that have all selectable genes in a 
contiguous region of the plasmid. While this invention 

15 embraces using the aforementioned elements of a selection 
system singly or in partial combination, most preferably 
all are employed. 

In one embodiment, the invention relates to a cell 
20 culture comprising a plurality of cells, each cell bearing: 

i) a gene coding on expression for a potential DNA- 
binding protein or polypeptide, where such protein or 
polypeptide is not the same for all such cells, but 
25 rather varies at a limited number of amino acid 

positions; and 

ii) at least two independent operons, each comprising at 
least one binding marker gene coding on expression 

3 0 for a product conditionally deleterious to the 

survival or reproduction of such cells, the promoter 
of each said binding marker gene containing a prede- 
termined target DNA subsequence so positioned that, if 
said target DNA subsequence is bound by a DNA-binding 

3 5 protein or polypeptide, said conditionally deleterious 

product is not expressed in functional form. 



wo 90/07862 

PCT/US90/00024 

16 

Most known DNA-binding proteins bind to palindromic or 
nearly palindromic operators. it is desirable to be able 
to obtaxn a protein or polypeptide that binds to a target 
DNA subsequence having no particular sequence sy^unetry. In 
5 another embodiment of the present invention, such a binding 
protein is obtained -by creating a hybrid of two dimeric 
DNA-binding proteins, one of which (DBP,,) recognizes a 
symmetrized form of the left subsequence of the target 
subsequence, and the other of which (DBPr) recognizes a 
symmetrized form of the right subsequence of the target 
subsequence. 



10 



15 



Cells producing equimolar mixtures of dbPl and DBPp 
contain approximately i part (DBPj,)^, 2 parts DBPl:DBPr; 
and 1 part (DBPr)^. The DBP^jOBPr heterodimers , which bind 
to the non-symmetric target subsequence, may be isolated 
from a cell lysate by affinity chromatography using the 
target sequence as the ligand. if desired, the hetero- 
dimers may be stabilized by chemically crosslinking the two 
2 0 binding domains. 



It is also possible to modify both DBP^ and DBPr, by a 
process of variegation and selection, so that they have 
(Without disturb;Lng their affinity for the predetermined 

2 5 DNA target subsequence), complementary but not dyad-sym- 

metric protein-protein binding surfaces. when such 
polypeptides are mixed, in vivo or in vitro, the primary 
species will be DBP^jDBPr heterodimers. Alternatively, re- 
versing the steps, a dimeric binding protein may be 

3 0 modified so that its two binding domains have complementary 

but not dyad-symmetric protein-protein binding surfaces 
and then the DNA-contacting surfaces are modified to bind 
to the right and left halves of the target DNA subsequence 
In either case, the resulting cooperative domains can be 
35 crosslinked for increased stability. ^ 

When a binding protein is engineered so that its two 
binding- domains have complementary, . but not dyad-symmetric 
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protein-protein binding surfaces, then in the preferred 
embodiment one of the steps will be a "reverse selection", 
i.e. a selection for a protein that does not bind to the 
symmetrized half-target sequence. To facilitate such 
5 reverse selection, it is desirable that the binding marker 
genes be capable of "two-way" selection (VIN087) . For a 
two-way selectable gene there exist both a first selection 
condition in which the gene products are deleterious 
(preferably lethal) to the cell and a second selection 

10 condition in which the gene product is beneficial (prefer- 
ably essential) to the cell. The first selection condition 
is used for forward selection in which we select for cells 
expressing proteins that bind to the target so that gene 
expression is repressed. The second selection condition is 

15 used for reverse selection in which we select for cells 
that do not express a protein that binds to the target, 
thereby allowing expression of the gene product. 

Abolition of function is much easier than engineering 
20 of novel function. Reverse selection can isolate cells 
that: a) express no DBP, b) express unstable proteins 
descendant from a parental DBP, c) express a protein 
descendant from a parental DBP having very nearly the same 
3D structure as the parental DBP, but lacking the func- 

2 5 tionality of the parent. We are interested in this third 

class. It is difficult, however, to distinguish among 
these classes genetically. Therefore, when using reverse 
selection, we carefully choose sites to mutate the protein 
(so as to minimize the chances of destroying tertiary 

3 0 structure) and we introduce a lower level of variegation 

than in forward selection. We must verify biochemically 
that a stable, folded protein is produced by the isolated 
cells. 

3 5 Another concept of the present invention is the use of 

a polypeptide, rather than a protein, to preferentially 
bind DNA. This polypeptide, instead of binding the DNA 
molecule as a preformed molecule having shape complementary 
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to DNA, will wind about the DNA molecule in the major or 
minor groove. Such a polypeptide has the advantages that- 
a) xt IS smaller than a protein having equivalent recogniz- 
ing ability and may be easier to introduce into cells, and 
5 b) xt may serve as a model for creation of other compounds 
that bxnd DNA seguenoe-specif ically . 

In a preferred embodiment, transcription of the DNA 
that codes on expression for potential-DNA-binding proteins 
or polypeptides is regulated by addition of chemical 
xnducer to the cell culture, such as isopropylthiogalac- 
•tosxde (IPTG). other regulatable promoters having dif- 
ferent inducers or other means of regulation are also 
appropriate . 



10 



15 



The invention encompasses the design and synthesis of 
varxegated DNA encoding on expression a collection of 
Closely related potential DNA-binding proteins or polypep- 
tides Characterized by constant and variable regions, said 
20 protexns or polypeptides being designed with a view toward 
obtaining a protein or polypeptide that binds a predeter- 
mxned target DNA subsequence. 

For the purposes of this invention, the term "poten- 
'.5 tial DNA-binding polypeptide" refers to a polypeptide 
encoded by one species of DNA molecule in a population of 
variegated DNA wherein the region of variation appears in 
one or more subsequences encoding one or more segments of 
the polypeptide having the potential of serving as a DNA- 
0 binding domain for the target DNA sequence or having the 
potential to alter the position or dynamics of protein 
residues that contact the DNA. a "potential DNA-binding 
protein" (potential -DBP) may comprise one or more potential 
DNA-binding polypeptides. Potential-DBPs comprising two or 
5 more polypeptide chains -may be homologous aggregates ( e.g. 
A2) or heterologous aggregates (e^^ AB) . .: .^ r^^^^^^^^^^^ 
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From time to time, it may be helpful to speak of the 
"parental sequence" of the variegated DNA. When the novel 
DNA-binding domain sought is a homolog of a known DNA- 
binding domain, the parental sequence is the sequence that 
5 encodes the known DNA-binding domain. The variegated DNA 
is identical with this parental sequence at most loci, but 
will diverge from it at chosen loci. When a potential DNA- 
binding domain is designed from first principles, the 
parental sequence is a sequence that encodes the amino acid 
10 sequence that has been predicted to foirm the desired DNA- 
binding domain, and the variegated DNA is a population of 
"daughter DNAs" that are related to that parent by a high 
degree of sequence similarity. 

15 The fundamental principle of the invention is one of 

forced evolution . The efficiency of the forced evolution 
is greatly enhanced by careful choice of which residues are 
to be varied. The 3D structure of the potential DNA- 
binding domain and the 3D structure of the target DNA 

2 0 sequence are key determinants in this choice. First a set 
of residues that can either simultaneously contact the 
target DNA sequence or that can affect the orientation or 
flexibility of residues that can touch the target is 
identified. Then all or some of the codons encoding these 

2 5 residues are varied simultaneously to produce a variegated 

population of DNA. The variegated population of DNA is 
introduced into cells so that a variegated population of 
cells producing various potential-DBFs is obtained. 

3 0 The highly variegated population of cells containing 

genes encoding potential-DBPs is selected for cells 
containing genes that express proteins that bind to the 
target DNA sequence ("successful DNA-binding proteins") . 
After one or more rounds of such selection, one or more of 
3 5 the chosen genes are examined and sequenced. If desired, 
new loci of variation are chosen. The selected daughter 
genes of one generation then become the parental sequences 
for the next generation of variegated DNA (vgDNA) . 
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DNA-binding proteins (DBFs) that bind specifically to 
viral DNA so that transcription is blocked will be useful 
in treating viral diseases, either by introducing DBFs into 
5 cells or by introducing the gene coding on expression for 
the DBF into cells and causing the gene to be expressed, 
in order to develop such DBFs, we need use only the 
nucleotide sequence of the viral genes to be repressed 
once a DBF is developed, it is tested against virus in 
10 vivo. use of several independently-acting DBFs that all 
bind to one gene allow us to: a) repress the gene despite 
possible variation in the sequence, and b) to focus 
repression on the target gene while distributing side 
effects over the entire genome of the host cell. Animals, 
plants, fungi, and microbes can be genetically made 
mtracellularly immune to viruses by introducing, into the 
germ line, genes that code on expression for DBFs that bind 
DNA sequences found in viruses that infect the animal 
(including human), plant, fungus, or microbe to be protect- 
20 ed. 

Sequence-specific DBFs may also be used to treat 
autoimmune and genetic disease either by repressing 
noxxous genes or by causing expression of beneficial 
25 genes. 

some naturally-occurring DBFs bind sequence-specif i- 
cally to DNA only in the presence of absence of specific 
effector molecules. For example. Lac repressor does not 

3 0 bxnd the lac opeifator in the presence of lactose or 
xsopropylthiogalactoside (IFTG) ; Trp repressor binds DNA 
only in the presence of tryptophan or certain analogues of . 
tryptophan. The method of the present invention can be 
used to select mutants of such DBFs that a) recognize a . i 

35 different cognate DNA sequence, or b) recognize a different ' ! 

: effector molecule. These -alterations . would;. be useful 
because:- a) known inducible or de-repressible -DBFs allows 
us to use the novel DBF without/affecting existing metabo- . • 
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lie pathways. Having novel effectors allows us to induce 
or de-repress the regulated gene without altering the state 
of genes that are controlled by the natural effectors. In 
addition, temperature-sensitive DBFs could be made which 
5 would allow us to control gene expression in the same way 
that X CI857 and Pr and Pl are used. 

Conferring novel DNA-recognition properties on 
proteins will allow development of novel restriction 

10 enzymes that recognize more base pairs and therefore cut 
DNA less frequently. For example, the methods of the 
present invention will be useful in developing a derivative 
of EcoRI (recognition GAATTC) that recognizes and cleaves a 
longer recognition site, such as TGAATTCA. Proteins that 

15 recognize specific DNA sequences may also be used to block 
the action of known restriction enzymes at some subset of 
the recognition sites of the known enzyme, thereby conferr- 
ing greater specificity on that enzyme. Other DNA-binding 
enzymes may also be obtained by the methods described 

20 herein. 

The methods of the present invention are primarily 
designed to select from a highly variegated population 
those cells that contain genes that code on expression for 
25 proteins that bind sequence-specif ically to predetermined 
DNA sequences. The genetic constructions employed can also 
be used as an assay for putative DBPs that are obtained in 
other ways. 

3 0 BRIEF DESCRIPTION OF THE DRAWINGS 



Figure 1 
Figure 2 
Figure 3 
3 5 Figure 4 
Figure 5 
Figure 6 
Figure 7 



Schematic of protein bound to DNA. 
Schematic of evolution of a binding protein. 
Plasmid pKK175-6. 
Plasmid pAA3H. 

Summary of construction of pEP1009. 
Plasmid pEPlOOl- 
Plasmid pEP1002. 
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Figure 8 
Figure 9 
Figure 10 
Figure li 
5 Figure 12 



Plasmid 
Plasmid 
Plasmid 
Plasmid 
Plasmid 



PEP1003, 
PEP1004, 
pEPlOOS. 
PEP1007. 
PEP1009. 
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DETAILED DESCRTPTTON OF THP tw e ntiQM 3^m. ^jts PPKFFPPPH 
EMBODIMENTS 

Abbreviations : 



15 



20 



25 



30 



The following abbreviations will be used throughout 
the present invention: 



Abbreviation 

DBP 

idbp 

Pdbp 

vgDNA 

dsDNA 

ssDNA 

Tc^, Tc^ 

Gal^, GalS 

Gal+, Gal" 

Fus^/ Fus^ 

KmR, KmS 



Ap^, :ApS 



35 



Meaning 

DNA-binding protein: i^- c t 
A gene encoding the initial^DBP 
A gene encoding a potential-DBP 
variegated DNA 
double-stranded DNA 
single-stranded DNA^ 
Tetracycline resistance or: 
sensitivity 

Galactose resistance or sensi- 
tivity 

Ability or inability to utilize 
galactose 

Fusaric acid resistance or 
sensitivity _ . 

Kanamycin resistance or sensi- 
tivity 

Ampicillin resistance or sensi- 
tivity ... 
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Terminoloav 

A domain of a protein that is required for the 
protein to specifically bind a chosen DNA target subse- 
5 quence, is referred to herein as a "DNA-binding domain". A 
protein may comprise one or more domains, each composed of 
one or more polypeptide chains. A protein that binds a DNA 
sequence specifically is denoted as a "DNA-binding pro- 
tein". In one embodiment of the present invention, a 

10 preliminary operation is performed to obtain a stable 
protein, denoted as an "initial DBP" , that binds one 
specific DNA sequence. The present invention is concerned 
with the expression of numerous, diverse, variant "poten- 
tial-DBPs", all related to a "parental potential-DBP" such 

15 as a known DNA-binding protein, and with selection and 
amplification of the genes encoding the most successful 
mutant potential-DBPs . An initial DBP is chosen as 
parental potential-DBP for the first round of variegation. 
Selection isolates one or more "successful DBPs". A 

2 0 successful DBP from one round of variegation and selection 

is chosen to be the parental DBP to the next round. The 
invention is not, however, limited to proteins with a 
single DNA binding domain since the method may be applied 
to any or all of the DNA binding domains of the protein, 
25 sequentially or simultaneously. 

Amino acids are indicated by the single-letter code, 
AUSU87, Appendix A- 

3 0 Symbols that represent ambiguous DNA are: T, C, A, G 

for themselves; M for A or C; R for A or G; W for A or T; 

S for C or G; Y for T or C; K for G or T; V for A, C, or G; 

H for A, C, or T; D for A, G, or T; B for C, G, or T; N for 
any base. 
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Conventionally, DNA sequences are written from 5- to 
3', left-to-right. 

anti-sense DNA: 5* ATG CTT TTC ... 3« 
sense DNA: 3 • TAG GAA AAG . . . 5 • 

mRNA: 5 ' AUG CUU UUC . . . 3 ' 
protein: t M - L - F - 

We will use the convention that the "sense" strand is the 
strand used as template for mRNA synthesis. 

In the present invention, the words "grow", "growth", 
"culture", and "amplification" mean increase in number, not 
increase in size of individual cells. in - the present 
invention, the words "select" and . "selection'!:, are used in 
the genetic sense; a biological process whereby a 

phenotypic characteristic is used to enrich a population 
for those organisms displaying the desired phenotype. 



One selection is called a "selection step"; one pass 
-of variegation followed by as many selection ; steps as are 
needed to isolate a successful DBP, is called a "variega- 
25 tion step". The amino acid sequence of one successful DBP 
from one round becomes the parental potential-DBP to the 
next variegation step. We perform variegation steps 
iteratively until the desired affinity and specificity of 
DNA-binding between a successful DBP and chosen target DNA 
3 0 sequence are achieved. 

In a "forward selection" step, we select for the 
binding of the PDBP to a target DNA sequence ;. in: a "reverse 
selection" step, for failure to bind. Ther-.target DNA 
sequence may be the final target sequence of interest, or 
the immediate target may be a related sequence of DNA 
(e.g., a "left symmetrized target" or "right symmetrized 
target"). There is an important distinction between 
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screening and selection. Screening merely reveals which 
cells express or contain the desired gene. Selection 
allows desired cells to grow under conditions in which 
there is little or no growth of undesired cells (and 
5 preferably eliminates undesired cells) . 

The term "operon" is used to mean a collection of one 
or more genes that are transcribed together. We will use 
operon to refer also to one or more genes that are tran- 
10 scribed together in eukaryotic cells independent of post- 
transcriptional processing . 

The term "binding marker gene" is used to mean those 
genes engineered to detect secfuence-specif ic DNA binding, 

15 as by association of a target DNA with a structural gene 
and expression control sequences. A single operon may 
include more than one binding marker gene (e.g., aalT.K ) . 
A "control marker gene" is one whose expression is not 
affected by the specific binding of a protein to the target 

2 0 DNA sequence. The "control promoter" is the promoter 
operably linked to the control marker gene. 

Palindrome, palindromic, and palindromically are used 
to refer to DNA sequences that are the same when read along 
2 5 either strand, e.g. 



Palindromic DNA 



Rotational axis 

30 4' 

5' C T A G C C T^A G G C T A G 3' 
3' GATCGGATCCGATC5' 



The arrow indicates the center of the palindrome; if the 
3 5 sequence is rotated 18 0° about the central dot, it appears 
unchanged. In the present application, "Palindromic" does 
not apply to sequences that have mirror symmetry within one 
strand, such as 
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Mirror Plane 

I 

5' C T A G C C T|T C C G A T C 3 
3* G A T C G G AlA G G C T A G 5 



DNA sequences can be partially palindromic about some 
point (that can be either between two base pairs or at one 
base pair) in which case some bases appear unchanged by a 
10 180° rotation while other bases are changed. 

A special case of partially palindromic sequence is a 
"gapped palindrome" in which palindromically related bases 
are separated by one or more bases that lack such symmetry: 

15 

Gapped Palindrome 
1 2 3 4 5 6 7 8 9 10 11 12 13 .14 15 16 

5' C T A G C T T T C C G G C T A G 3« 

3 • S A T C G A A A G G C C G A T C 5« 

20 

has CTAGC (bases 1-5) palindromically related to GCTAG 
(bases 12-16) while the sequence TTTCCG (bases 6-11) in 
the center has no symmetry. 

2^ "the purposes of this invention, a "non-deleterious 

cloning site" is a region on a plasmid or phage that can be 
cut with one restriction enzyme or with a combination of 
restriction enzymes so that a large linear molecule 
comprising all essential elements can be recovered. 



Overview; Standard Methods 



Bacterial strains are cultured by standard methods 
( DAVIS 0, MILL72, AUSU87) . Constiructions of vectors are by 

35 standard methods (MANI82, '•,Z0LL84, . AUSU87):.: All genetic 
constructions are -confirmed, ..first by. analysis. .: with 

■ restriction, enzymes, and then by ?.sequencing.-. vSeguencing is 
by the Sanger dideoxy method or by. Maxam Gilbert ^chemical 
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method. Constructions that confer a phenotype are tested 
for display of the desired phenotype. These necessary 
controls are not described repeatedly. 

5 Overview; The Selection System 

The present invention separates mutated genes that 
specify novel proteins with desirable sequence-specific 
DNA-binding properties from closely related genes that 

10 specify proteins with no or undesirable DNA-binding 
properties, by: 1) arranging that the product of each 
mutated gene be expressed in the cytoplasm of a cell 
carrying a chosen DNA target subsequence, and 2) using 
genetic selections incorporating this chosen DNA target 

15 subsequence to enrich the population of cells for those 
cells containing genes specifying proteins with improved 
binding to the chosen target DNA sequence. 

A selectably deleterious gene is positioned relative 

2 0 to, usually downstream from, the target sequence so that 

the gene is not expressed if a successful DNA-binding 
protein specific to this target is expressed in the cell 
and binds the target sequence • Alternatively, a selectable 
beneficial g^ene may be arranged so that its transcription 
25 is occluded by a strong promoter (ADHY82, ELLE89a, ELLE8- 
9b) . The target sequence is placed in or near the occlud- 
ing promoter so that successful binding by a protein will 
repress the occluding promoter and allow transcription of 
the beneficial gene. Elledge and coworkers disclose that 

3 0 such systems work best in the bacterial chromosome or on 

low-copy-number plasmids. The cell will survive exposure 
to the selective conditions transcription of the selectably 
deleterious genetic element is blocked. 

3 5 The preferred cell line or strain is easily cultured, 

has a short doubling time, has a large collection of well 
characterized selectable genes, includes variants that are 
deficient in genetic recombination, and has a well devel- 
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oped transformation system that can easily produce at least 
10 independent trans formants/ug of DNA. Bacterial cells 
are preferred over yeasts, fungi, plant, or animal cells 
because they are superior on every count. Among bacteria, 
5 coli is the premier candidate because of the wealth of 

knowledge of genetics and cellular processes. other 
bacterial strains, such as tvohiTnut- i , Pseudomonas 

■ aeruginosa , ^Xebsiena aSEoseoes, EaciHus subtilis, or 
Streptomyces coelicolor could be used. DBFs that bind to 
10 host regulatory sequences, such as promoters, will be 
toxic. Thus, development of a DBF that specifically binds 
to coli . promoters is preferably done in a cell line or 
strain, such as coelicolor, having significantly 

<^ifferent promoter sequences. 



15 



In the most preferred embodiments, all novel DBFs are 
developed in coli recA" strains. The recA" genotype is 
preferred over other rec" mutations because recA " mutation 
reduces the frequency of recombination more than other 

2 0 known rec- mutations and the recA"" mutation has fewer un- 

desirable side effects. We choose a host strain that 
methylates or does not methylate the target sequence in the 
-desired way. For example a Dcm- strain is appropriate if 
the target sequence contains ccwGG and we want a DBF that 
25 binds the unmethylated form. 

As vectors, phage, such as M13 , have the advantage of 
a high infectivity rate. Organisms or phage having a phase 
in their life cycle in which the genome is single-stranded 

3 0 DNA have a higher mutation rate than organisms or phage 

that have no phase in which the genome is single-stranded - 
DNA. Plasmids are, however, preferred because genes on 
plasmids are much more easily constructed and altered than 
are genes in the bacterial chromosome and are more stable 
35 than genes borne on phage, such as.Ml3. M13 derived 
vectors are nearly as preferred as plasmids. 
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In some embodiments, the cloning vector will carry: a) 
the selectable genes for successful DBP isolation, b) the 
pdbp gene, c) a plasmid origin of replication, and d) an 
antibiotic resistance gene not present in the recipient 
5 cell to allow selection for uptake of plasmid. Preferably 
the operative vector is of minimum size. 

Alternatively, the selectable binding marker genetic 
elements are placed on a vector different from but com- 
10 patible with the vector that carries the pdbp gene. This 
arrangement has the advantages that engineering the pdbp 
gene is easier on a smaller plasmid and manipulation of 
pdbp can not introduce mutations into the selectable 
binding marker genes. 

15 

Standard selections for plasmid uptake and maintenance 
in E. coli include use of antibiotics ( e.g. ampicillin 
(Ap) ) as shown in Table 2. Selection of cells with 
antibiotics is preferred to nutritional selections , e . g , 

2 0 TrpA*^, for several reasons. Nutritional selection may be 

overcome by large volumes of cells or growth medium; host 
chromosomal auxotrophy is rarely total; crossfeeding of the 
non-growing cells by prototrophic recipients obscures the 
outlines of the colonies; and late mutations to prototrophy 
25 may arise on the plate due to spontaneous mutation of 
nongrowing cells. Nonetheless, nutritional selection may 
be employed. 

Similarly, plasmids for use in B. subtilis are 

3 0 engineered for selection of uptake and maintenance using 

antibiotics. Plasmids used in streptomycete species bear 
genes for resistance to antibiotics such as thiostrepton, 
neomycin, and methylenomycin, in preference to auxotrophic 
markers or sporulation and pigment screens such as spo in 
35 bacilli and mel in streptomycetes . 

Recombinant DNA manipulations in yeasts have been 
achieved using complementation of auxotrophic markers, some 
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of which are shown in Table 3. High backgrounds are 
surmounted by use of two unrelated binding marker genes 
carried on the same vector, e^, Leu2+ and Ura3+. 
selection for G418 resistance conferred by the bacterial 
5 afihli gene expressed in yeast offers the advantages of 
reduced background and a wider range of appropriate 
recxpxent strains. The current upper range of efficiency 
. of DNA uptake into yeast cells indicates that this organism 

- is not now preferred for the process described in this 
10 patent, although results could be achieved by large scale 

. practice. 

The selection systems must be so structured that other 
mechanisms for loss of gene expression are much less likely 
15 than the desired result, repression at the target DNA 
subsequence. other mechanisms that could yield the desired 
. phenotype include: point mutations that inactivate the 
deleterious gene or genes, deletion of the deleterious gene 
or genes, host mutations that suppress the deleterious 

2 0 genes, and repression at a site other than the target DNA 

sequence. 

- A wide range of selectable phenotypes for coli and 
fcyphimurium have been described (VIN087) . Two broad 

25 Classes of selections are useful in this invention, 
nutritional and chemical. Such selections are inherently 
conditional in that they employ addition of a growth-in- 
hibitory chemical to the selective medium, or manipulation 
of the nutrient components of the selective medium. 

3 0 Further conditionality of the preferred method is imposed 

by transcriptional regulation (e^ by IPTG in combination 
with the lacUVS promoter and the Laciq repressor) of the 
variegated Edbp gene. In those members of the population 
that express DBPs that bind to the target, IPTG indirectly 
controls the selectable genes; in these cells, increased 
IPTG leads to reduced expression of the selectable genes. 
Therefore the phenotypes .for selection are distinguished ^ 
only in the presence of an inducing chemical , . and potential 
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deleterious effects of these phenotypes are avoided during 
storage and routine handling of the strains. 

Selection of mutant strains capable of producing 
5 proteins that can bind to the target DNA subsequence is 
enabled by engineering conditional lethal genes or growth- 
inhibiting genes located downstream from the promoter that 
contains the target DNA subsequence. In the preferred 
einbodiment, at least two independent conditional lethal or 

10 inhibitory selections are performed simultaneously. It is 
possible to use a single selection to achieve the same 
purpose, but this is not preferred. Two selections are 
strongly preferred since a simple mutation in the selected 
gene, occurring at a frequency of 10"^ to 10"®/cell, would 

15 occur in two selected genes simultaneously at the product 
of the individual frequencies, 10"^^ io~16^ Thus use of 
two selections substantially reduces the probability of 
isolation of artif actual revertant or suppressor strains, 

2 0 Selectable genes for which both forward and reverse 

selections exist are preferred because, by changing host or 
media, we can use these genes to select for binding by a 
DBP to a target DNA sequence such that expression of one of 
these genes is repressed, or we can select phenotypes 
25 characteristic of cells in which there is no binding of the 
DBP. For example, expression of the tet gene is essential 
in the presence of tetracycline. On the other hand, 
expression of the tet gene is lethal in the presence of 
fusaric, acid. Expression of the galT and aalK genes in a 

3 0 GalE" host in the presence of galactose is lethal (NIKA61) . 

Expression of aalT and qalK in a host that is GalE"*" and 
either GalT" or GalK" renders the cells Gal"*" and allows 
them to grow on galactose as sole carbon source. 



35 



The term "source of a selective agent" includes the 
selective agent itself and any media components which cause 
the cell to manufacture the selective agent. 
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The Detailed Examples describe selection of strains 
with successful DBP binding to novel target subsequences 
due to turn off of two genes, each of which, if expressed, 
confers sensitivity to a toxic substance. It is also 
5 possible to use selection of strains in which successful 
DBP binding to novel target operators turns off repressors 
of genes encoding required gene products. For example, 
using the binding marker gene P22 arc , we place an Arc 
operator site so that binding of Arc represses expression 

10 of a beneficial or conditionally essential gene, such as 
ampc Another alternative is selection of expression of 
required gene products due to successful binding of DBP 
proteins derived from positive effectors as the DBP, e.g. 
CAP from E^^ coli, the repressor from phage \, or the Cro67 

15 (BUSH88) mutant of \ CrOo Another alternative is to places 
the target sequence in or near a strong promoter that 
occludes transcription of a conditionally essential gene 
(ELLE89a,b) o 

20 The selections described in the Detailed Examples 

employ commercially available cloned genes on plasmids in 
strains that can be obtained from the ATCC (Rockville, MD) • 
Alternatively, the genes can be produced synthetically from 
published sequences or isolated from a suitable genomic or 

25 cDNA library. 

Numerous types of selections are possible for selec- 
tion of DBP expression in E. coli . The toxic and inhib- 
itory agents listed in Table 4 are used with appropriately 

3 0 engineered host strains and vectors to select loss of gene 
function listed above. Repression of transcription of 
these genes allows growth in the presence of the agents. 
Other outcomes such as deletions or point mutations in 
these genes may also be selected with these agents, hence 

35 two functionally unrelated selections are used in combina-? 
' tiono These agents share the property that =,cell metabolism 
is stopped, and unlike the nutritional selections, ^the 
inhibitory agents are not overcome by components of the 
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growth medium or turnover of macromolecules in the cells. 
Selections using antibiotics, metabolite analogs, or 
inhibitors are preferred. Another class of selections 
includes those for repression of phage or colicin recep- 
5 tors, or for repression of phage promoters. These agents 
kill by single-hit kinetics, and in the case of phage, are 
self-replicating, making the multiplicity of agent to 
putative repressed cell much more difficult to control and 
so are not preferred (BENS8 6) o 

10 

Any selection system relevant to the cell line or 
strain may be substituted for those in the examples given 
here, with appropriate changes in the engineering of the 
cloning vectors. One example is the dominant pheS "^ gene 
15 carried on plasmid pHE3 (ATCC #37,161) in a pheS12 back- 
ground. Turn-off of pheS "^ is selected with p-f luorophenyl- 
alanine (Sigraa Corp., St. Louis, MO). 

We could choose the Streptomvces coelicolor cloned 

2 0 glucose kinase gene for selection of the DBP"*" phenotype, 

using the metabolite analog deoxyglucose. 

Each batch of antibiotic is checked for MIC (minimum 
inhibitory concentration) under the condition of use. 
25 Increased concentration of antibiotic may be used to 
increase the stringency of the select io, in most cases. 

The user varies the medium formulation (pH, cation 
concentrations, buffering agent, etc. ) for a particular 

3 0 selection if the results are not optimal with the strain at 

hand. For example, Maloy and Nunn (MAL081) describe a 
medium yielding improved selection of Fus^ E^ coli colonies 
from a Tc^ background, compared to the medium employed by 
Bochner (BOCH8 0) for this purpose using typhimurium . 

35 

Stringency of selection can be modulated by control- 
ling copy number of plasmids bearing the selectable genes; 
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increasing copy number of selectable genes increases the 
stringency of the selection. 

During the initial phases of the progressive devel- 
5 opment of DBP molecules, it is desirable to produce a high 
intracellular concentration of DBP. The stringency of the 
selection is increased in subsequent phases of successful 
DBP development by allowing fewer molecules of DBP per 
cell. Thus it is preferred to regulate transcription of 
10 EdbE by an inducible or derepressible promoter, such as 
PlacUVS . 

High total cell input often decreases stringency of 
selections, by providing metabolites that are specifically 
omitted, by mass action with respect to an inhibitory 
agent, or by generating a large number of artificial 
satellite colonies that follow the appearance of genetical- 
ly resistant colonies. The number of cells that are 
successfully transformed is a function of efficiency of 
ligation and transformation processes, both of which are 
optimized in the embodiment of this invention. Procedures 
for maximal transformation and ligation efficiency are from 
Hanahan (HANA85) and Legerski and Robberson (LEGE85) re- 
spectively. Increasing stringency is imposed under the 
25 conditions of high efficiency of these processes by 
inoculation of plates with small voliames or dilutions of 
cell samples. Pilot experiments are performed to deter- 
mine optimum dilution and volume. 

3 0 In Detailed Example 1, the transformation event is 

followed by dilution and growth of cells in permissive 
medium following transformation- Exogenous inducer of DBP 
expression is included at this step, and a set of selec- 
tions are then imposed in liquid medium. Surviving cells 

35 are concentrated by centrifugation, and selected for these 
and additional traits using solid medium in Petri plates.. 
• This protocol offers the advantage that r fewer^ identical ■ 
siblings are obtained and a larger population is easily 
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screened. In Detailed Example 1, repression of the Gal^ 
phenotype is selected by exposing transf ormants to galac- 
tose in liquid medium, which produces visible lysis of 
galactose sensitive cells. The second selection employed 
5 in Detailed Example 1 is for the Fus^ phenotype due to 
repression of Tc^, which requires limitation of total 
inoculum size to 10^ cells/plate. Similar protocol 
variations are introduced to combine selections for 
transformation and successful DBP function. 

10 

Tests of selective agents to determine the conditions 
that kill or inhibit sensitive cells are performed with 
pure cultures of sensitive cells. These include strains 
carrying the selective marker genes having the recognition 
15 sequence of the IDBP as target, with and without idbp . and 
with and without the inducer of idbp expression. 

Cultures of sensitive cells are applied to selective 
media as inocula appropriate to the selection (usually 10^ 

2 0 to 10^ per plate) . Sufficient numbers of replicates (10*7 

to 10^ total sensitive cells for each medium) are tested by 
each selection. The rate at which the cultures produce 
revertants and phenotypic suppressors (considered together 
as revertants) is determined. A rate greater than 10"^ per 

25 cell indicates that stringency must be increased. If 
reversion rates are below this level, as we have shown for 
the selections described in Example 1, mixing experiments 
are performed to determine the sensitivity of recovery of a 
small , fraction of resistant cells from a vast excess of 

30 sensitive cells. 

Normally, the deleterious gene product of a binding 
marker gene is a protein. It may also be an RNA, e.g. , an 
mRNA which is antisense to the mRNA of an essential gene 

3 5 and therefore blocks translation of the latter mRNA into 

protein. Another alternative is that transcription of the 
binding marker gene may be deleterious because this 
transcription occludes transcription of an adjacent 
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beneficial gene. Selectively deleterious genes suitable 
for use in the present invention include those shown in 
Table 4. 



5 The two selectably deleterious genes are preferably 

not functionally related. For example, the chosen genes 
should not code for proteins localized to or affecting the 
^ same macromolecular assembly in the cell or which alter the 
same or intersecting anabolic or catabolic pathways. Thus, 
10 use of two inhibitors that select for mutations affecting 
RNA synthesis, aromatic amino acid synthesis, or each of 
histidine and purine synthesis are not preferred. Similar- 
ly, two inhibitors that are transported into the cell by 
shared membrane components are thus functionally related, 
15 and are not preferred. In this manner the user reduces the 
frequency of isolation of single host mutations that yield 
the apparent desired phenotype, because of suppression of 
the shared functionality, interacting component, or 
precursor relationship. Host mutations of this type are 
conveniently distinguished by a screen of the selectable 
phenotypes in the absence of the inducer of the DBP, e.g. 
IPTG. 

Examples of pairs of deleterious genes which are 
25 recommended for use in the present invention are given in 
Table 5A. In each case, one of the paired genes codes for 
a product that acts intracellularly while the other codes 
for a product that acts either in transport into or out of 
the cell or acts in an unrelated biological pathway. Table 
5B gives some pairs that are not recommended. These pairs 
have not been shown to malfunction, but they are not 
recommended, given the large number of choices that are 
clearly functionally unrelated. 
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A preferred novel feature is the use of a copy of the 
promoter of one of these beneficial - or conditionally 
essential genes, operably linked to the target- DNA subse- 
quence, to direct transcription of the selectably deleter- 
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ious or conditionally lethal binding marker genes of the 
plasmid. If the potential-DBP should repress the selec- 
table gene by binding to this promoter, it would also 
repress this beneficial activity. 

5 

In order to assure that selection for DBP binding is 
specific to the target and not the promoter, we, preferab- 
ly, place one of the two selectable binding marker genes 
under the same transcription initiation signal as the gene 
10 we use for selection of vector maintenance. In Detailed 
Example 1, transcription of the aalT and galK genes is 
initiated by the Pamp P^romoter, as is the amp gene. 

It is possible that the potential-DBP will bind 

15 specifically to the boundary between the target DNA 
sequence and the promoter, or within the structural gene. 
In the preferred embodiment, we discriminate against this 
mechanism by choosing a different promoter, operably linked 
to another copy of the same target DNA sequence, for the 

2 0 second selectable gene. Preferably, the two promoters that 
initiate transcription of the selectable genes should be 
strong enough to give a sensitive selection, but not too 
strong to be repressed by binding of a novel DBP. Some 
well studied promoters and their scores by the Mulligan 

25 algorithm (MULL84) are shown in Table 6. Promoters that 
score between 50% and 70% are good candidates for use in 
binding marker genes. Preferably, the two promoters have 
significant sequence differences, particularly in the 
region of the junction to the target DNA sequence. 

30 Specifically, the region between the -10 region and the 
target sequence, which comprises five to seven bases, 
should have no more than two identical bases in the two 
promoters. Although the -10 regions of promoters show high 
homology, promoters are known ( e.g. Pamp having GACAAT and 
^neo having TAAGGT) that have as few as two out of six 
bases identical in this region, and such difference is 
preferred. 
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The target DNA sequence for the potential DNA-binding 
protein must be associated with the two deleterious binding 
marker genes and their promoters so that expression of the 
binding marker genes is blocked if a novel protein in fact 
5 binds to the target sequence. The target DNA sequence 
could appear upstream- of the gene, downstream of the gene, 
or, in certain hosts, in a noncoding region ( viz, an 
"intron") within the gene. Preferably, it is placed 
upstream of the coding region of the gene,, that is, in or 

10 near the RNA polymerase binding site for the gene,. i,e. the 
promoter. if the binding marker gene is an occluding 
promoter, the target is, preferably, placed downstream of 
the promoter. Placement of the target DNA sequence 
relative to the promoter is influenced by two main consid- 

15 erations: a) protein binding should have a strong effect on 
transcription so that the selection is sensitive, b) the 
activity of the promoter in the absence of a binding 
protein should be relatively unaffected by the presence of 
the test DNA sequence compared to any other target sub- 

2 0 sequence. 

In the present invention, we will deal primarily with 
^DNA target subsequences of 10 to 25 bases. It has been 
noted that the highly conserved -35 region and the highly 
25 conserved -10 region are separated by between 15 and 21 
base pairs with a mode of 17 base pairs (HAWL83, MULL84) . 
Some of the bases between -35 and -10 are statistically 
non-random; thus placement of target DNA sequences longer 
than 10 bases between the -10 and -35 regions would likely 

3 0 affect the promoter activity independent of binding by 

potential-DBPs. Because quantitative relationships between 
promoter sequence and promoter strength are not well 
understood; it is preferable, at present, to use known 
promoters and to position the target at the edge of the RNA 
35 polymerase binding site, 

Protein binding to DNA has maximum effect on tran- . 
scription if the binding site is in or just down-stream ; 
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from the promoter of a gene. Hoopes and McClure (HOOP87) 
have reviewed the regulation of transcription initiation 
and report that the LexA binding site can produce effective 
repression in a variety of locations in the promoter 
5 region. In a preferred embodiment, we place the target DNA 
sequences that begin with A or G so that the first 5' base 
of the target sequence is the +1 base of the mRNA, as the 
LexA binding site is located in the uvrD gene (HOOP87, 
pl2 35) . If the target sequence begins with C or T, we 
10 preferably place the target so that the first 5* base of 
the target is the +2 base of the mRNA and we place an A or 
G at the +1 position. An alternative is to place the 
target DNA sequences upstream of the -3 5 region as the LexA 
binding site is located in the ssb gene (HOOP87, pl235). 

15 

0 

It may be useful in early stages of the development of 
a DBF to have more than one copy of the target DNA sequence 
positioned so that binding of a DBP reduces transcription 
of the selectable gene. Multiple copies of the target DNA 

2 0 sequence enhances the sensitivity of phenotypic charac- 

teristics to binding of DBPs to the target DNA sequence. 
Multiple copies of the target DNA sequence are, preferably, 
placed in tandem downstream of the promoter. Alternative- 
ly, one could place one copy upstream of the promoter and 
25 one or more copies downstream. 

We arrange the genes on the plasmid or plasmids in 
such a way that no single deletion event eliminates both 
deleterious genes without also eliminating a gene essential 
30 either to plasmid replication or cell survival. Thus, 
resistant colonies are unlikely to arise through deletions 
because two independent deletion events are reqpiired. 
Similarly, simultaneous occurrence of one point mutation 
and one deletion is as unlikely as two point mutations or 

3 5 two deletions. 



A typical arrangement of genes on the operative 
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cloning vector, similar to that used in Detailed Example 1, 
is: 



/ PJ.acUV5p-BdbB-t Vx^ — aalT - aalK -t \ 



10 — 



15 



20 



25 



30 



35 



40 



> > 

\ Px-amp-t PvS - tet -t ori -/ 



Sx represents the promoter that initiates transcription of 
the aniE gene. A second copy of Px initiates transcription 
of qalT,K . is a promoter driving tet . t is a transcrij)- 

tional terminator (different terminators may be used for 
different genes) , and ^ is the target subsequence. PlacUVS 
is the lacUVS promoter, £ represents the lacO operator, and 
Edb£ is a variegated gene encoding potential DBFs. 
Placement of the pdbp relative to other genes is not 
important because mutations or deletions in pdbp cannot 
cause false positive colony isolates. Indeed, it is not 
necessary that the pdbp gene be on the selection vector at 
all . The purpose of the selection vector is to ensure that 
the host cell survives only if the one of the PDBPs binds 
to the target sequence (forward selection) or fails to so 
bind (reverse selection). The pdbp gene may be introduced 
into the host cell by another vector. 

Two-way selections are available for both tet and 
'?^^T,K (vide supra) . The orientation of each gene in the 
selection vector is unimportant because strong terminators 
( ^•^' rrnBtl, rrnBt2, phage fd teifminator) are preferably 
placed at the ends of each transcription unitl That galT . K 
and tet are separated by essential genes, ' however, is '"of 
fundamental importance. The sequence ori is essential for 
plasmid replication, and the ' amp aeridf the"^ transcription of 
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which is initiated by Px, is essential in the presence of 
Ap. Successful repression of galT ^K and tet is selected 
with galactose and fusaric acid. No single deletion event 
can remove both the latter genes and allow plasmid main- 
5 tenance or cell survival under selection* In addition, 
binding by a novel DBP to the £x promoter would render the 
cell Ap sensitive. These arrangements make appearance of a 
novel DBP that binds the target DNA more probable than any 
of the other modes by which the cells can escape the 
10 designed selections. 

Overview: Choice of target DNA binding seqfuence for 

development of successful novel DBPs: 

15 Our goal is the development, in part by conscious 

design and in part by in vivo selection, of a protein 
which binds to a DNA sequence of significance, e.g., a 
structural gene or a regulatory element, and through such 
binding inhibits or enhances its biological activity. In 

2 0 the preferred embodiment, the protein represses transcrip- 
tion of a deleterious element, such as a viral gene. A 
sufficiently long sequence could be the target of several 
independently acting DBPs. 

2 5 Another goal of this invention is to derive one or 

more DBPs that bind sequence-specif ically to any predeter- 
mined target DNA subsequence. It is not yet possible to 
design the DBP-domain amino-acid sequence from a set of 
rules , appropriate to the target DNA subsequence « Rather, 
30 it is possible to pick sets of residues that can affect the 
DNA recognition of a parental DBP. Then, variegation of 
residues that affect DNA recognition coupled with selection 
for binding to the target DNA subsequence can produce a 
novel DBP specific for the target DNA subsequence. Such a 

3 5 method is limited by the number of amino acids that can be 

varied at one time. To develop a novel DBP that recognizes 
15 bases could require changing 15 or more residues in the 
initial DBP. Variegation of 15 residues through all 2 0 
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amino acids would produce 20^^ = 3o3 x 10^^ sequences and 
is beyond current technology. Thus we start with the 
recognition sequence of the initial DBP, change two to five 
bases and select, in one or more rounds of variegation and 
5 selection, a novel DBP that recognizes this new target DNA 
subsequence- This new DBP becomes the parent to the next 
step in which the target DNA subsequence is changed by an 
additional two to five bases so that a stepwise series of 
changes in binding protein and changes in target is used. 
10 It is emphasized here that, although we initially select 
DBPs that recognize sequences similar to that recognized by 
the IDBP, the ultimate target sequence recognized by the 
desired final DBP can be completely unrelated to the 
recognition sequence of the IDBP, 

15 

The process of finding a DBP that recognizes a 
sequence within a genome is shortened if we pick sequences 
that have some similarity to the cognate sequence of the 
initial DBP, The intent is to locate several unique sites 
20 in the gene which can be bound specifically by DBPs such 
that transcription through those sites is reduced. 

The sequences of some regions of genes of eukaryotic 
pathogens vary among strains (SAAG88) . To optimize the 
25 search for target sites in the gene selected for repression 
such that repression will be effective in all or the 
majority of strains of a pathogen, regions of conserved DNA 
sequence within the gene are, preferably, identified. 

3 0 There may be a very small number of sequences that 

occur in the genome of the host cells for which binding of 
a DBP will be lethal. For this reason, the regulatory 
sequences, such as promoters, of the host organism are not 
preferred targets for DBP development , Preferably , the 

35 target sequence occurs only in the gene of interest. For 
some applications, target sequences that occur at locations 
other than the site of intended action may be used if 
binding of a protein to the extra sites is acceptable. 
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Preliminary elimination of non-unique sequences is 
done by searching DNA sequence data banks of host genomic 
sequences and bacterial strain sequences, and by searching 
5 the plasmid secjuences for matches to the potential target 
subsequences- Remaining potential target subsequences are 
then used as oligonucleotide probes in Southern analyses of 
host genomic DNA and bacterial DNA. Sequences which do not 
anneal to host or bacterial DNA under stringent conditions 
10 are retained as target subsequences. These target subse- 
quences are cloned into the operative vector at the 
promoters of the selection genes for DBP function, as 
described for the test DNA binding sequence. 

15 Choice of target subsequences is based also on the 

optimal location of target sites within a gene such that 
transcription will be maximally affected. Studies of 
monkey L-cells show that lac repressor can bind to lac 
operator, or to two lac operators in tandem, in the L-cell 

20 nucleus (HUMC87, HUMC88) . Further, this binding results in 
repression of a downstream chloramphenicol acetyl trans- 
ferase gene in this system, and repression is relieved by 
IPTG. Two tandem operators repress CAT enzyme production 
to a greater extent than a single operator. The user 

25 preferably locates two to four target sites relatively 
close to each other within the transcriptional unit. 

Overview: Strategies for Obtaining Protein Recognition of 
Non-Symmetric Target DNA Seguences 

30 

In vitro, lac repressor binds to a perfectly palin- 
dromic synthetic lac operator which omits the central base 
pair of the natural operator 10 times more tightly than it 
does to the wild-type operator (SADL83) . In vivo, the 
3 5 synthetic operator represses beta-galactosidase activity to 
a 4 -fold lower level than does the wild-type repressor. 
Simons et al. (SIM084) describe the isolation of five lac 
operator-like subsequences from eukaryotic DNA that titrate 
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lac repressor in vivo. All five subsequences share a 14 bp 
consensus subsequence that lacks the central base pair of 
the natural lac operator and is a perfect palindrome of the 
left seven base pairs of the natural lac operator. A 
5 synthetic 11-base pair inverted repeat of the left half of 
the soli lac operator binds lac repressor 8 -fold more 
tightly than does the natural operator. We conclude that 
- natural repressors have not evolved to have maximal 
= affinity for their operators, rather they have evolved to 
10 produce optimal regulation. 

Ej. coli trp repressor (BASS87) and X repressor 
(BENS88) symmetrized operator subsequences bind their 
respective repressors more tightly than do the natural 
15 operators. For X repressor, unlike lac repressor, the 
optimal binding subsequence both includes a base pair at 
the center of symmetry and contains a non-consensus base 
pair (BENS88) , 

2° It is important to note that the focus of all of the 

above experiments has been on symmetry: symmetric oper- 
ators, symmetric changes in protein binding residues, etc. 
i-ln the natural systems discussed above, increasing operator 
subsequence symmetry towards the consensus palindrome does 

25 indeed increase the strengths of the binding interactions. 
This result arises, however, not from symmetry ser se, but 
from optimizations of the protein-DNA interactions at both 
operator half-sites. if the DNA-binding protein presents a 
different binding domain to the operator at each half -site, 

3 0 synimetric DNA operator subsequences are not only not 
optimal but are unfavorable. The implications of this 
distinction have not been considered in the literature. 

Starting from natural, dyad symmetric or de novo 
35 designed DBFs we can generate specific .DBFs with non- 
symmetric target recognition using a variety of strategies. 
Seven examples of strategies are listed; however, thi^ 
- invention is not limited to these particular strategies. 
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1) Produce two dimeric DBPs- One DBP is produced by the 
means described here to recognize a symmetrized 
version of the left half of the target and is called 

5 DBPl* The other DBP is similarly produced to recog- 

nize a symmetrized version of the right half of the 
target and is called DBPp. Cells producing equimolar 
mixtures of DBPr and DBPl contain approximately 1 part 
DBPl dimer^ 2 parts DBPliDBPr heterodimer, and 1 part 
10 ^^^R dimer. Thus one half of the DBP molecules bind 

to the non-symmetric target subsequence. These 
heterodimers may be isolated by affinity separation 
techniques, or the 50% active mixture may be used 
directly. 

15 

2) Produce a mixture of DBPr and DBPl described in (1) 
and crosslink proteins with an agent such as glutar- 
aldehyde. Use a column that contains the DNA target 
subsequence to purify DBPl^DBPj^ heterodimer from the 

2 0 homodimers. 

3) Produce (by variegation of the dimerization interface 
of a known DBP, as described more fully hereafter) a 
heterodimer. comprised of complementing mutant sequen- 

25 ces DBPl and DBP2 such that the heterodimer DBP1:DBP2 

is exclusively formed. Next, alter the recognition 
domains of DBPl and DBP2 by the methods described here 
to produce heterodimers having asymmetric recognition, 
B.a, DBP1l:DBP2j^. 

30 

4) Produce a heterodimer DBP1:DBP2 as in (3) and cross- 
link the proteins in vitro with an agent such as 
glutaraldehyde as in (2) • 

35 5) Produce two dimeric DBPs with left and right target 
recognition elements as in (1) ; produce complementing 
heterodimer mutations as in (3) such that the non- 
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symmetric recognition heterodimer DBP1l:DBP2j^ is 
constructed • 

6) Produce a pseudo-dimer composed of a single polypep- 
5 tide chain such that recognition elements that 

contact different bases are encoded by different 
codons; each DNA-cont acting residue and every domain 
is independently variable and so asymmetric recogni- 
tion can be established. 

10 - 

7) Produce DBPl and DBPr in separate steps where heter- 
odimers of DBP:DBPj^ is developed to recognize a 
hybrid target consisting of the wild type left half- 
site fused to the right half of the target and 

15 DBPlsDBP is developed to recognize a hybrid target 

consisting of the wild type right half-site fused to 
the left half of the target. Once produced, DBPl and 
DBPr are co-expressed intracellularly as described in 
(1) above, crossl inked as described in (2) above, or 

20 are modified to produce the obligately complementing 

non-symmetric recognition heterodimer DBP^sDEPj^ as 
described in (5) above. 

Detailed E;xample 1 employs strategy 5; Detailed 

2 5 Example 2 employs strategy 6. Section 6 of Detailed 

Example 1 also describes strategy 3. 

For each target DNA sequence chosen, a left arm T^, a 
center core and a right arm Tj^ are defined. Two 

3 0 symmetrized derivatives of this target subsequence, the 

left symmetrized target Tl^^-Tq-Tl^" and the right sym- 
metrized target Tr<"-Tc-Tr"> are designed and synthesized - 

We divide the target DNA sequence into T^, Tq, and Tr 
35 based on knowledge of the interaction of the parental DBP 

with DNA sequences to which it binds/ i.e. the operator. 
; * - This knowledge may come from X-ray structures / of parental 

DBP-operator complexes, models based on 3D structures of 
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the DBP, genetics, or chemical modification of parental 
DBP-operator complexes . 



Our strategy is to pick a target by finding a sequence 
5 that contains a close approximation to the central core of 
the operator. Bases in the center of the target may not be 
contacted directly by the DBP but affect the specificity of 
binding by influencing the position or flexibility of the 
bases that are contacted directly by the DBP. Accommodat- 

10 ing changes (operator vs . target) in uncontacted bases may 
require subtle changes in the tertiary ot quaternary 
structure of the DBP, such as might be effected by altera- 
tions in the dimerization interface of a dimeric DBP. We 
can accommodate most changes in bases directly contacted by 

15 the DBP by altering the residues that contact those bases. 
Therefore, it is easier to accommodate changes in those 
bases that are directly contacted by the DBP and we 
endeavor to avoid changes in the central core by seeking a 
target the central core of which is highly similar to the 

2 0 central core of the operator of the parental DBP. 

We must balance two tendencies: a) if we assign too 
many bases to Tq, we are unlikely to find a close approx- 
imation of Tq in the genome of interest; and b) if we 
25 assign too few bases to Tq, we may thereby assign uncon- 
tacted bases to the arms. Differences between the target 
and the DNA sequence that binds the initial DBP at uncon- 
tacted bases in the arms may be difficult to accommodate 
through variegation of residues that contact the DNA 

3 0 directly; such a situation could cause variegation and 

selection to yield a functional DBP very slowly. Prefer- 
ably, the length of Tq is at least 6 but not greater than 
10. 



3 5 We search the target genome, first with the entire 

operator binding sequence, and then with progressively 
shorter central fragments of the operator, until an 
acceptable match is found. A match is acceptable if all or 



wo 90/07i862 PCr/US90/00024 



48 

almost all the bases f e.ao six out of seven) match and 
other criteria are met. 



Consider matching 6 of 7 bases as the criterion for 
5 choosing a target. The original sequence is acceptable, as 
are the 21 (=7 x 3) sequences that differ by one base. 
There are 4^ = 2^^ = 16384 possible heptamers. Thus we 
should expect to find an acceptable match every 16384/22 = 
745 bases. Similarly, matching 7 of 8 bases should occur 

10 every 65536/25 = 2622 bases? matching 8 of 9 bases should 
occur every 262144/28 = 9362 bases . These expected 
frequencies are such that viruses, which have genome sizes 
ranging from 5 x 10^ bases up to 10^ bases or more, should 
have one or more matches of 6 of 7 bases. Larger viruses 

15 should contain matches of 7 of 8 or even 8 of 9 bases. 

Other criteria may include restricting the search to 
parts of the genome not known to vary among different 
isolates of the organism. 

20 

If the longest matching search sequence is such that 
bases known to have no direct contact with the DBP are 
assigned to the arms, then we increase the size of T^ to at 
least seven and then use a progression of core sequences to 

25 move in a stepwise fashion from a sequence that closely 
resembles the operator of the parental DBP to that of the 
target. We obtain an acceptable DBP for each target by 
variegation and selection. The best DBP from one target 
becomes the parental DBP for the next target in the 

3 0 progression. Accommodating changes in uncontacted bases in 
the central core may require variegation of residues in the 
proteins protein interface to produce subtle changes in the 
tertiary and quaternary structure of the DBP. 

35 To illustrate this probess> we consider the target 

chosen for Detailed Example i; - The -HIV 353-369 target 
subsequence ACTTTCCGCTGGGGACT^-' is nucleotides 353 to 369 of 
the HIV- 1 genome •(RATN85)V chosen because of the close 
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match of the central 7 bp of the Kim Consensus sequence 
(CCGCGGG) of X Cro (KIMJ87) to the underscored bases. No 
non-variable HIV sequence matched the nine central bases of 
any Oj^3-like sequence. Highly conserved bases of X Oj^3 are 
5 written bold with stars above. 



123456789 
* * * A * * 

0^3 5 • TATCACCGCAAGGGATA 3 ' 
10 3« ATAGT GGCGTTC CCTAT 5' 



HIV-1 5 ' ACTTTCCGCTGGGGACT 3 • 
3 • TGAAAGGCGACCCCTGA 5 ' 
353 I 

15 

'^L' '^C' '^R defined, in this case, as the first 5 

bases, the center 7 bases (underlined) , and the last 5 
bases, respectively, of the target subsequence. '^lT^ 
differs from the corresponding bases of Op3 at four of 

2 0 five bases, including the strongly conserved A2 and C4 . 

Tl*^" is complementary to Tl"-^: 

Tl"> = 5 ■ ACTTT 3 ' 

Tl<" = 3 • TGAAA 5 ' 

25 

We create the symmetrized target by rotating Tl"*^" about 
the center of the 17 bp sequence into the same strand as 
Tl->: 

Tl"^ = 5 ' ACTTT AAAGT 3 ' = T^"^" 

3 0 Tl"^" = 3' TGAAA ] 5' 



Tj^^" differs from the corresponding bases of Oj^3 at three 
of five positions, including the highly conserved A2 . We 
rotate Tj^"^" into the same strand as Tj^"-^: 
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Tr^" = 5' AGTCC GGACT 3" = Tr"> 

3' t CCTGA 5> = Tr<" 

Syitimetrized operators derived from HIV 353-369 are: 

5 

Left Symmetrized Target: 
5 ' A C T T T C C G C T G G A A A G T 3 ' 
3 * T G A A A G G C G A C C T T T C A 5 ■ 
I Tl"> 1 Tc I Tl<- 1 

10 

and 

Right Symmetrized Target: ^ 
5 ' AGTCC : C C G C T G G G G A C T 3 ' 
3 « T C A G G G G C G A C C C C T G A 5 ' 
15 I Tr<- I Tc I Tr-> I 

The two symmetrized derivatives are engineered into the 
appropriate vectors so that each of these sequences 
V regulates the expression of the designed selectable genes 
2 0 of each of the respective vectors. 



Had the best match in HIV to the CCGCGGG core been, 
; f or example, CTGCTGG, then we would use the symmetrized 
targets shown above until we found acceptable DBFs for the 
2 5 right and left targets. At that point, we would change the 
symmetrized targets to : 

Second Left Symmetrized Target: 

30 f 

5 ' A C T T T CTGCTGG A A A G T 3 • 
3' T G A A A . G A C G A C C T T T C/ A 5' ^ 



35 and 
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Second Right Symmetrized Target: 

t 

5 ' A G T C C C T G C T G G G G A C T 3 ' 
3' T C A G G G A C G A C C C C T G A 5' 
5 I Tr<- I Tc I Tr-> I 

Using these targets and the selected right and left DBFs as 
new parental DBFs, we would initiate a new round of 
variegation and selection. 

10 

As another example, consider the 14 bp 434 operator. 
We could take each arm as 4 bp and the central core as 6 
bp. We are likely to find good matches to the 6 bp core in 
any genome larger than 4 096 bases. Thus, this division of 
15 the operator is preferred over that which assigns 5 bp to 
each arm and 4 bp to the core. 



In order to obtain proteins that bind to these 
symmetrized targets, we generate a population of potential 
2 0 dbp genes by synthesizing DNA that codes on expression for 
part or all of a potential DBP and having variegated bases 
in the codons that encode residues of the parental DBP that 
are thought to contact the DNA or that influence the 
detailed position or dynamics of residues that contact the 

2 5 DNA. The variegation in the chosen codons, embodied in 

the synthetic DNA, is transfered to the pdbp gene either by 
replacement of a cassette or by annealing a mutagenic 
oligonucleotide to ssDNA. 

3 0 The pdbp gene may be part of the vector that carries 

the selectable binding marker genes or may be separate. 
Two sets of selectable binding marker genes are prepared, 
one carrying the Right Symmetrized Targets (RST) and one 
carrying the Left Symmetrized Targets (LST) . If the pdbp 
3 5 gene is on a different vector from the selectable binding 
marker genes, then RST and LST selection strains are 
prepared. A highly variegated population of pdbp genes is 
delivered into cells that also contain one of: a) the RST- 
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10 



containing selectable genes, or b) the LST-containing 
selectable genes. 

The two sets of transformed cells are selected for 
vector uptake and successful repression at low stringency 
of selection. In the case described in Detailed Example 1, 
cells containing DBFs will be Tc^, Fus^, and Gal^. 

After one or more variegation steps, DBFs that bind 
tightly and specifically to each of the Left Symmetrized 
and Right Symmetrized Targets are obtained. These DBFs are 
designated, in general terms, DBPl and DBFr, respectively. 
If these proteins are produced in equal amounts in the same 
cell, then approximately 50% of DBF protein dimers consist 
15 of the DBPl:DBPr heterodimer. This may be sufficient for 
repression of the target. m the preferred embodiment , 
further mutations are introduced into the DBP^ and DBFr 
proteins, as described below, to enable 100% of the 
molecules to form heterodimers . 

20 

In an especially preferred embodiment, variegation of 
the gene to alter its DNA-specif icity is combined with 
^variegation of the gene to alter its dimerization (protein- 
protein binding) characteristics, so that the formation of 
25 the heterodimer DBFl:DBPr is favored. The variegation of 
the dimerization interface may precede (strategy 3) or 
follow (strategy 5) the alteration of the DNA specificity. 
Simultaneous variegation at both sites is also possible. 

^° "^^^ DNA-binding proteins considered here interact 

with specific DNA sequences as multimers (usually dimers or 
tetramers) (PAB084) . Monomers usually associate indepen- 
dently and the resulting multimer interacts with DNA. 
Coupling between oligomerization and DNA-binding equilibria 

35 results in explicit inclusion of , oligomerization effects in 
the apparent affinity of DNA-binding proteins for their 
operators (JOHN80, RIGG70, and CHAD71) . 
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The precise geometry of the protein in the complex 
with DNA strongly influences the strength of the interac- 
tion with DNA. For example, Sauer et al. generated a 9 2 
amino-acid fragment of X repressor carrying the YCSS 
5 mutation. This N-terminal domain dimerizes through a 
covalent S-S bond. Although the dissociation into monomers 
is unmeasurable, the binding to DNA is diminished about 10- 
fold relative to intact X repressor (SAUE8 6) . 



10 The results presented by Reidhaar-Olson and Sauer 

(REID88) , summarized in Table 7, show which residues in 
the dimerization region, when varied, will produce func- 
tional homodimers of N-terminal domains with little 
alteration of structure. Wide variation is tolerated at 

15 solvent exposed positions 85, 86, and 89. In contrast, 
almost no substitutions are tolerated at the buried 
positions 84 and 87. Most hydrophobic residues are 
functional at position 91 (except P (We use the single- 
letter code for amino acids, AUSU87, Appendix A)) although 

2 0 aromatic residues are excluded. The hydrophobic interac- 
tions among 184, M87 ' and V91' had previously been shown to 
be major components of dimerization free energy (NELS83, 
WEIS87b) . In general, mutations that destabilize X 

repressor N-terminal dimerization are similar to those that 

25 destabilize global protein structure. 

The P22 Mnt repressor, like X Cro, is a small 
protein containing both DNA-binding and oligomerization 
sites. Unlike Cro, P22 Mnt is a tetramer in solution 
30 (VERS85b, VERS87a) . The amino acid sequence of Mnt has 
been determined (VERS87a) but the three dimensional 
structure of the protein is not known. Knight and Sauer 
(KNIG88) have shown, by sequential deletion of c-terminal 
residues, that Y78 is essential for tetramer formation. 

35 

A preferred embodiment of this process utilizes 
information available on protein structure obtained from 
crystallographic, modeling, and genetic sources to predict 
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the residues at which mutation results in stable protein 
monomers that retain substantially the same 3D structure as 
the wild-type DBP, but that fail to form dimers. Dimeriza- 
tion mutants are constructed using site-directed mutagen- 
5 esxs to isolate one or more user specified substitutions at 
chosen residues. The process starts using one of the genes 
selected for binding to a symmetrized target, denoted 
(dbEi could be either the gene or the dbfiR gene) , as 

the parental sequence, so that each of several specific 
10 mutations is engineered into the gene for a protein 
binding specifically to the symmetrized target used in the 
selection (the Left Symmetrized Target in the case of the 
db^L gene) . 

15 Reverse selection isolates cells not expressing a 

protein that binds to the target DNA sequence. This 
phenotype could arise in several ways/ including: a) a 
mutation or deletion in the gene so that no protein is 

produced, b) a mutation that renders the descendant of the 

20 parental DBPi unstable, c) a mutation that allows the 
descendant of the parental DBPi to persist and to fold into 
•nearly the same 3D structure as the parental DBP, but which 
prevents oligomerization. it is anticipated that reverse 
selection will isolate , many genes for non- functional 

25 proteins and that these proteins must be analyzed until a 
suitable oligomerization-mutant is found. Therefore we 
choose sites carefully so that we maximize the chance of 
disrupting oligomerization without destroying tertiary 
structure. We also use lower levels of variegation in 

3 0 reverse selection so that the number of mutants to be 
analyzed is not too large. For forward selection, the 
number of different mutants is preferably lo4 to lo^ and 
more preferably greater than 10^, For reverse selection 
It is lo3 to l-oS. (Under certain circumstances, the number 

35 of reverse selection mutants could be as low as 10-20) . 

cassettes bearing the site-specitic changes are 
synthesized and each is ligated into • the vectoi at the 
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appropriate site in the dbp j gene. Transf onnants are 

obtained by the antibiotic-resistance selection for vector 
maintenance f e.a> Ap) , and screened for loss of repression 
of the selective systems under control of Dbpj binding. 
5 Defective dimerization results in substantially decreased 
DNA affinity, hence the altered derivatives are recognized 
by screening isolates obtained using the selectable gene 
systems* In Detailed Example 1 (where ^bpj is dbp^) , 
dimerization-defective derivatives are Tc^, Fus^ and Gal^ 
10 in coli delta4 cells (Gal"^ in cells of coli strain 

HBlOl) . Appropriate controls are used to verify that the 
loss of repression is due to a substitution in dbp ^* 

In Detailed Example 1 (using an engineered synthetic X 
15 cro gene designated rav) , the mutant Rav protein with 
specific binding to the Left Symmetrized Target (designated 
RavL; gene, rav ^) is used to produce a derivative defective 
in dimerization. Studies of X Cro suggest that the dimer 
is stabilized by interactions in an antiparallel beta sheet 
20 between residues E54, V55 and K56 from each monomer 
(ANDE81, PAB084). In addition, F58 appears to stabilize 
the Cro dimer through hydrophobic interactions between FSB 
of one monomer and residues in the hydrophobic core of the 
other monomer . (TAKE85) . Further, mutational studies 

2 5 (PAKU86) show that some substitutions at E54 and at F58 

result in decreased intracellular specific protein levels 
and that these mutant proteins lack repressor activity. 
Mutants are constructed by using site specific mutagenesis 
to isolate VF55 and FW58 mutants of Rav^. (Point mutations 

3 0 are written as XYnnn, where X is the amino acid found at 

location nnn and Y is the amino acid found in the mutant.) 
The cassettes bearing mutations that confer the VF55 and 
FW58 substitutions are synthesized, and each is ligated 
into the operative vector at the appropriate site within 
35 the ravL gene. Selections and characterizations are as 
described above. These alleles are designated rav^-SS and 
rav^-SS . 
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Alternative methods of obtaining dimerization-def ac- 
tive DBP derivatives are not excluded. Thus the rav+ gene 
(Detailed Example 1) or a potential-dbE+ gene coding for 
any globular dimerizing protein, may be subjected to 
5 Structure-directed Mutagenesis of residues involved in the 
protein-protein dimer interface. In the case of the rav+ 
allele, residues 7, 23, 25, 30, 33, 40, 42, 52, 54, 55 and 
58 are candidates for mutagenesis. 

10 For example, mutagenesis of rav+ residues 52, 54, 55 

and 58, using a cassette carrying vgDNA at codons specify- 
ing these residues, is followed by ligation and transforma- 
tion of cells. Selection is applied for plasmid main- 
tenance (Kp'R) and loss of repression (Tc^, and galactose 

15 utilization in HBlOl cells). Variegated plasmid DNA is 
purified from a population of Ap^ Tc^ Gal+ cells and 
analyzed with restriction enzymes and Southern blotting. 
Plasmid preparations containing the vg-rav fragment of 
predominantly rav+ molecular weight are retained, and are 

2 0 designated vg-ravA. 

To isolate a second dimer-specif ic rav mutant protein, 
^designated RavB, such that the mutation in ravB is comple- 
mentary to a mutation contained in the vg-ravA population, 
25 Structure-directed Mutagenesis is performed on a second 
copy of the rav gene, designated ravB, carried on a plasmid 
conferring a different antibiotic resistance ( e.g. Km^) . 
Residues affecting the same dimer interface are varied. 
Competent vg-ravA cells are transformed with the vg ravB 

3 0 plasmid preparation. Transformants are obtained as ApR 

Km^, and further selected for Rav+ phenotype using the 
selection systems (Tc^, Fus^, Gal^ in an coli delta4 

cell genetic background) . 

surviving colonies are analyzed by --restriction 
analysis of plasmids, a'lid are backcrossed to -obtain pure" 
plasmid lines that confer each of the Ap^ and -KidR pheno- 
types. In this manner, mutants bearing obligate comple- 
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menting dimerization alleles of ravA and ravB are isolated. 
These rav mutations may be tested pairwise to confirm 
complementation, and are sequenced. The information 

obtained from these mutants is used to introduce these 
5 dimerization mutations into rav^ and rav R genes previously 
altered by Structure-directed Mutagenesis in DNA-binding 
specificity domains as described above. 



In one preferred embodiment of this invention (stra- 
10 tegy 3), isolation of dbp^ and dbp p^ mutations that confer 
specific and tight binding to target DNA sequences TjJ^^-Tq- 
Tl"^" and Tj^^~-Tc-Tr"^ is followed by engineering of second 
site mutations causing a dimerization defect, for example 
dbp^-l as described herein. Complementing mutations are 
15 introduced into each of the dbp p and dbp^ genes, such that 
obligate heterodimers are co- synthesized and folded 
together in the same cell and bind specifically to the non- 
palindromic targets . 

2 0 A primary set of residues is identified. These 

residues are predicted, on the basis of crystallographic , 
modeling, and genetic information, to make contacts in the 
dimer with the residue altered to produce DBPl-1. A 
secondary set of residues is chosen, whose members are 

2 5 believed to touch or influence the residues of the primary 

set. An initial set of residues for Focused Mutagenesis in 
the first variegation step is selected from residues in the 
primary set. A variegation scheme, consistent with the 
constraints described herein, is picked for these residues 

3 0 so that the chemical properties of residues produced at 

each variegated codon are similar to those of the wild-type 
residue; e.g. hydrophobic residues go to hydrophobic or 
neutral, charged residues go to charged or hydrophilic. A 
cassette containing the vgDNA at the specified codons is 
3 5 synthesized and ligated into the db^R gene carried in a 
vector with a different antibiotic selection than that on 
the vector carrying the dbp^-l gene. For example, in 

Detailed Example 1, rav j^-55 or rav j^-58 are encoded on 
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plasmids that carry the gene for Ap^. Variegated ravj^ 
genes are cloned into plasmids bearing the gene for Km^. 

The protein produced by the dbpL-l allele carrying 
5 the dimerization mutation fails to bind to the dhp j^ target 
Tl"^"Tc-Tl'^". Cells bearing this target as a regulatory 
site upstream of selection genes display the DBP" pheno- 
type. This phenotype is employed to select complementary 
mutations in a dbpp ^ gene. Following ligation of the 

10 mutagenized cassette into the appropriate ( e>a. Km^) 
plasmid, cells bearing dbP ]^-l on a differently marked f e.a> 
Ap^) plasmid are made competent and transformed • Trans- 
f ormants that have maintained the resident plasmid ( e.g. 
Ap^ Km^) are further selected for DBP"*" phenotype in which 

15 binding of the non-pal indromic target subsequence Tl'^^-Tq- 
I^r"^ is required for repression- Only heterodimers con- 
' sisting of a 1:1 complex of the dbp j^-l gene product and the 
complementing dbs^-l gene product bind to the target, and 
produce Tc^ Fus^ Gal^ colonies in the appropriate cell 

2 0 host. 

Each resident plasmid is obtained from candidate 
colonies by plasmid preparation and transformation at low 
plasmid concentration • Strains carrying plasmids encoding 

2 5 either mutant dbp ^-l or mutant dbpj^-l genes are selected 

by the appropriate antibiotic resistance ( e.g. in Detailed 
Example 1, Ap^ or Km^ selection, respectively) . Plasmids 
are independently screened for the DBP" phenotype, charac- 
terized by restriction digestion and agarose gel electro- 

3 0 phoresis, and plasmid pairs are co-tested for complementa- 

tion by restoration of the DBP'*' phenotype with respect to 
the Tl"^-Tc-Tj^"> target when" both dbp alleles are present 
intracellularly. Successfully complementing pairs of dbp 
genes are sequenced Subsequent variegation steps may be 
3 5 required' to optimize dimer interactions or DNA binding by 
the heterodimer. 
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In Detailed Example 1, the VF55 change in Rav^-SS 
introduces a bulky hydrophobic side group in place of a 
smaller hydrophobic residue, A complementary mutation 
inserts a very small side chain, such as G55 or A55, in a 
5 second copy of the protein. In this case, the primary set 
for mutagenesis is V55, A secondary set of residues 
includes nearby components of the beta strand E54 , K5 6, 
P57, and E53 as well as other residues. In the initial 
variegation step, residues 53-57 are subjected to Focused 
10 Mutagenesis, such that all amino acids are tested at this 
location. Cells containing complementing mutant proteins 
are selected by requiring repression of the nonpalindromic 
HIV 353-369 target subsequence Tl""^-Tc-Tr"^ • 

15 In another embodiment, the ravL-58 allele carrying 

the substitution FW58 and conferring the Rav" phenotype, is 
used for selection of complementing mutations following 
Structure-directed Mutagenesis of the rav p gene. Residues 
L7, Li23, V25, A33, 140, L42, A52 , and G54 are identified 

2 0 as the principal set. Residues in the secondary set 

include F58, P57 , P59, and buried residues in alpha helix 
1. In the initial variegation step residues 23, 25, 33 , 
40, and 42 are varied through all twenty amino acids. 
Subsequent iterations, if needed, include other residues of 
25 the primary or secondary set. In this manner, rav p-1 . -2, 
-3, etc. are isolated, each of which yields a protein that 
is an obligate complement of the rav ^-SB mutation. 
Selection for Rav"*" phenotype using the HIV 353-369 target 
Tl"'^-Tc-Tp~^ sequence is used as described in the preferred 

3 0 embod iment • 



In either the preferred or alternative embodiments, 
this process teaches a method of constructing obligate 
complementing mutations at an oligomer interface. These 
3 5 pairs of mutations may be used in further embodiments to 
engineer novel DBPs specific for HIV 681-697 and HIV 760- 
77 6 targets; for targets in other pathogenic retroviruses 
such as HTLV-II; for other viral DNA-containing pathogens 
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10 



such as HSVI and HSVII; as well as for non-viral targets 
such as deleterious hxiinan genes. Similar methodology is 
claimed for engineering DBFs for use in animal and plant 
systems . 

overview; Selection of the mitial nwA-Bind i ncr Pr-oi-^-iT. 

for Variegation 

The choice of an initial DBF is determined by the 
degree of specificity required in the intended use of the 
successful DBF and by the availability of known DBFs. The 
present invention describes three broad alternatives for 
producing DBFs having high specificity and tight binding to 
target DNA sequences. The present invention is not limited 
15 to these classes of initial potential DBFs. 

A first alternative is to use a polypeptide that will 
conform to the DNA and can wind around the DNA and contact 
the edges of the base pairs. A second alternative is to 
use a globular protein (such as a dimeric H-T-H protein) 
that can contact one face of DNA in one or more places to 
.achieve the desired affinity and specificity. A third 
-alternative is to use a series of flexibly linked small 
globular domains that can make contact with several 
25 successive patches on the DNA. 

DNA features in fluencing choice of an initial drp; 

Features of DNA that influence the choice of an 
3 0 initial DBF include sequence-specific DNA structure and the 
size of the genome within which the DBF is expected to 
recognize and affect gene expression. 

Sequence-specific aspects of DNA structure, that can 
35 influence protein binding .include: a) the edges of the 
bases exposed in : the • major .groove, b) the edges of the: 
bases exposed in the:- minor; :. groove,. c) the requilibrium 
"r - positions of the 'Phosphate and deoxyribose groups, d) the 



20 
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flexibility of the DNA toward deformation, and e) the 
ability of the DNA to accept intercalated molecules. Note 
that the sequence-specific aspects of DNA are carried 
mostly inside a highly charged molecular framework that is 
5 nearly independent of sequence. 

The strongest signals of sequence are found in the 
edges of the base pairs in the major groove, followed by 
the edges in the minor groove. The groove dimensions 
10 depend on local DNA sequence (NEID87b, KOUD87, UL7^87) . 

The number of base pairs required to define a unique 
site depends on the size and non-randomness of the genome. 
Consider a genome of length Zg bases and consider a 
15 specific subsequence of length Q. If the genome is random, 
the subsequence is expected to occur N(Q) times, where 

2 Zg 2 Zg 

N(Q) = = 

20 4Q 22Q 

From this equation, we derive the expression Q^/ which is 
the lower limit of the length of subsequences that are 
expected to occur once or be absent: 

25 

Qu = log2(2 Zg)/2. 



Zg 


loa-. (2 Z^)/2 


Qu 


10^ 


10.5 


11 


107 


12 . 1 


13 


10^ 


13.8 


14 


109 


15. 5 


16 


lolo 


17 . 1 


18 



Thus, a DNA subsequence comprising 12 base pairs may be 
unique in the coli genome (5 x 10^ bp) , but is likely to 
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occur about 18 0 times in a random sequence the size of the 
human genome (3 x 10^ bp). 

The non-random nature of DNA sequences in genomes has 
5 been shown to result in the over- and under-representation 
of specific sequences. The random-genome model can under- 
estimate the probe length needed to define a unique coding 
^ sequence (LATH85) . Recognition sites for certain restric- 
■^^tion enzymes occur in clusters and are found much more 
10 often than expected (SMIT87) . In contrast, lac repressor 
binding sites in eukaryotic genomes are almost two orders 
of magnitude less frequent than expected on the basis of 
random secfuence (SIM084) . 

Protein features influenni n a choice of initial DBP; 

Sequence-specific binding to DNA by DBFs does not 
require unpairing of the bases. Most sequence-specific 
binding by proteins to DNA is thought to involve contacts 
20 -in the DNA major groove. 

To be certain of unique recognition in the human 
^'genome, it is best to design a protein that recognizes 19 
to 21 base pairs. To contact 20 base pairs directly, a 

25 protein would need to: a) wind two full turns around the 
DNA making major groove contacts, b) make a combination of 
major groove and minor groove contacts, or c) contact the 
major groove at four or five places. An extended polypep- 
tide, binding in the major groove of B-DNA, lies about 5.0 

30 A from the DNA axis. One base pair and 1 1/2 amino acids 
extend roughly equal distances along the helix (SAEN83, 
p238). 

A nine residue alpha helix, such as the recognition 
35 helices of H-T-H repressors, extends about 13.5 S along the 
major groove. If residues with long side chains are 
" located at each terminus of the helix, the helix can make 
: - contacts over a 20.0 k stretch of the major groove allowing 
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six base pairs to be contacted. Parts of the DBP other 
than the second helix of the H-T-H motif can make addition- 
al protein-DNA contacts, adding to specificity and affin- 
ity. The rigidity of the alpha helix prevents a long 
5 helix from following the major groove around the DNA. A 
series of small domains, appropriately linked, could wind 
around DNA, as has been suggested for the zinc-finger 
proteins (BERG88a, GIBS88, FRAN88). In an extended 

configuration a polypeptide chain progresses roughly 3.2 to 
10 3.5 S between consecutive residues. Thus, a 10 residue 
extended protein structure could contact 5 to 8 bases of 
DNA. 



Stable complexes of proteins with other macromolecules 
15 involve burial of 1000 fi^ to 3000 of surface area on 

each molecule. For a globular protein to make a stable 
complex with DNA, the protein must have substantial surface 
that is already complementary to the DNA surface or can be 
deformed to fit the surface without loss of much free 
20 energy. Considering these modalities we assign each 
genetically encoded polypeptide to one of three classes: 

1) a polypeptide that can easily deform to complement the 
shape of DNA, 

25 

2) a globular protein, the internal structure of which 
supports recognition elements to create a surface 
complementary to a particular DNA subsequence, and 

3 0 3) a sequential chain of globular domains, each domain 
being more or less rigid and complementary to a 
portion of the surface of a DNA subsequence and the 
domains being linked by amino acid subsequences that 
allow the domains to wind around the DNA. 



35 



Complementary charges can accelerate association of 
molecules, but they usually do not provide much of the free 
energy of binding. Major components of binding energy arise 
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from highly complementary surfaces and the liberation of 
ordered water on the macromolecular surfaces. 

Properties of seauenc:e>-g pecif ic DNA-bindina bv 
5 polypeptides : 

An extended polypeptide of 24 amino acids lying in 
the major groove of . B-DNA could make sequence-specific 
interactions with as many as 15 base pairs, which is about 
the least recognition that would be useful in eukaryotic 
systems. Peptides longer than 24 amino acids can contact 
more base pairs and thus provide greater specificity. 



10 



15 



Extended polypeptide segments of proteins bind to DNA 
in natural systems (e.g. x repressor and Cro, P22 Arc and 
Mnt repressors). The DNA major groove can accommodate 
polypeptides in either helical, or extended -conformation. 
Side groups of polypeptides that lie in the major groove 
can make sequence-specific or sequence-independent con- 
2 0 tacts. since the polypeptide can lie entirely within the 
major groove, contacts with the phosphates are allowed but 
not mandatory. Thus a polypeptide need not be highly 
.--positively charged. A neutral or slightly positively 
charged polypeptide might have very low non-specific 
25 binding. 

Polypeptides composed of the 2 0 standard amino acids 
are not flat enough to lie in the minor groove unless the 
sequence contains an extraordinary number of glycines, 
however, residue side-groups could extend into the minor 
groove to make sequence-specific contacts. Polypeptides of 
more than 50 amino acids may fold into stable 3D struc- 
tures. Unless part of the surface of the structure is com- 
plementary to the surface of the target DNA subsequence, 
35 formation of the 3D structure competes with DNA binding, 
i Thus polypeptides generated . for selection oof specific 
- .binding are preferably -25 to 50 -amino acids rin-lerigth. 



30 
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Polypeptides present the following potential advan- 
tages: 

a) low molecular weight: an extended polypeptide offers 
5 the maximum recognition per amino acid, 

b) polypeptides have no inherent dyad symmetry and so are 
not biased toward recognition of palindromic sequen- 



10 



ces, 

c) polypeptides may have greater specificity than 
globular proteins, and 

d) peptides may be good models from which other low 
15 molecular weight compounds may be designed. 

Thus, one would choose a polypeptide as initial DNA- 
binding molecule if high specificity and low molecular 
weight are desired • 

20 

No sequence-specific DNA-binding by small poly- 
peptides has been reported to date. Possible reasons that 
such polypeptides have not been found include: a) no one 
has sought them, b) cells degrade polypeptides that are 
25 free in the cytoplasm, and c) they are too flexible and are 
not specific enough* 

In a preferred embodiment, a DNA-binding polypeptide 
is associated with a custodial domain to protect it from 
3 0 degradation, as discussed more fully in Examples 3 and 4. 

Properties of globular proteins influencing choice of 

initial DBP: 

3 5 The majority of the well-characterized DBPs are small 

globular proteins containing one or more DNA-binding 
domains. No single-domain globular protein comprising 200 
or fewer amino acids is likely to fold into a stable 
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structure that follows either groove of DNA continuously 
for 10 bases. The structure of a small globular protein 
can be arranged to hold more than one set of recognition 
elements in appropriate positions to contact several sites 
along the DNA thereby achieving high specificity, however, 
the bases contacted are not necessarily sequential on the 
DNA. For example, each monomer of X repressor contains two 
sequence-specific DNA recognition regions: the recognition 
helix of the H-T-H region contacts the front face of the 
DNA binding site and the N-terminal arm contacts the back 
face. To obtain tight binding, a globular protein must 
contact not only the base-pair edges, but also the DNA 
backbone making sequence- independent contacts. These 
sequence-independent contacts give rise to a certain 
sequence-independent affinity of the protein for DNA. The 
bases that intervene between segments that are directly 
contacted influence the. position and flexibility of the 
contacted bases. if the DNA-protein complex involves 
twisting or bending the DNA (e^ 434 repressor-DNA 
complex) , non-contacted bases can influence binding through 
their effects on the rigidity of the target DNA sequence. 



The phage repressors Arc, Mnt, X repressor and Cro are 
proposed to bind to DNA at least partly via binding of 
extended segments of polypeptide chain. The N-terminal arm 
of X repressor makes sequence-specific contacts with bases 
in the major groove on the back side of the binding site. 
The C-terminal "tail" of X Cro is proposed to make se- 
quence-independent contacts in the minor groove of the 
3 0 DNA. The structure of neither Arc nor Mnt has been 
determined; however, the sequence specificity of the N- 
terminal arm of Arc can be transferred to Mnt; viz. when 
Arc residues 1-9 are fused to Mnt residues 7 through the C- 
terminal, the fusion protein recognized the ^ES operator 
but not the mnt operator. Residues 2, 3, 4, 5, 8, and 10 of 
Arc have been proposed to contact operator DNA and residue 
6 -of Mnt has been. shown to be . Involved r in sequence-specific 
operator contacts. '/I. ,. 
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Binding to non-pal indromic sequences requires altera- 
tion of dyad-symmetric proteins. Even non-palindromic DNA 
has approximate dyad symmetry in the deoxyribophosphate 
5 backbone; proteins that are heterodimers or pseudo-dimers 
engineered from known globular DBFs are good candidates for 
the mutation process described here to obtain globular 
proteins that bind non-palindromic DNA. It has been 
observed that the DNA restriction enzymes having palin- 

10 dromic recognition are composed of dyad symmetric multimers 
(MCCL86) , while restriction enzymes and other DNA-modif ying 
enzymes (e.g. Xis of phage X) having asymmetric recognition 
are comprised of a single polypeptide chain or an asym- 
metric aggregate (RICH88) . Such proteins may also provide 

15 reasonable starting points to generate DBFs recognizing 
non-palindromic sequences . 

A globular protein can bind sequence-specif ically to 
DNA through one set of residues and activate transcription 

2 0 from an adjacent gene through a different set of residues 

(for example, X or P2 2 repressors) • The internal structure 
of the protein establishes the appropriate geometric 
relationship between these two sets of residues. Globular 
proteins may also bind particular small molecules, effec- 
25 tors, in such a way that the affinity of the protein for 
its specific DNA recognition subsequence is a function of 
the concentration of the particular small molecules ( e.g. 
CRP and [cAMP]). Conditional DNA-binding and gene activa- 
tion are most easily obtained by engineering changes into 

3 0 known globular DBFs. 

Some DBFs from bacteria and bacteriophage have been 
shown to have sufficient specificity to operate in mam- 
malian cells. 

35 

An initial DBF may be chosen from natural globular 
DBFs of any cell type. The natural DBF is preferably 
small so that genetic engineering is facile. Preferably, 
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the 3D structure of the natural DBP is known; this can be 
determined from X-ray diffraction, NMR, genetic and 
biochemical studies. Preferably, the residues in the 
natural DBP that contact DNA are known. Preferably the 
5 residues that are involved in multimer contacts are known. 
Preferably the natural operator of the natural DBP is 
known o More preferably , mutants of the natural operator 
are known and the effects of these mutants on binding by 
-natural DBP and mutant DBPs are known* Preferably, 

10 mutations of the DBP are known and the effects on protein 
folding, multimer formation, and in vivo half life-time are 
known. Most of the above data are available for X Cro, X 
repressor and fragments of X repressor, 4 34 repressor and 
Cro proteins, coli CRP and trp repressor, P22 Arc, and 

15 P22 Mntc 

Globular DBPs are the best understood DBPs. In many 
cases, globular DBPs are capable of sufficient specificity 
and affinity for the target DNA sequence. Thus globular 

2 0 DBPs are the most preferred candidates for initial DBP. 

Table 8 contains a list of some preferred globular DBPs for 
use as initial DBPs. 

X repressor and phage 434 repressor have been exten- 
25 sively studied (CHAD71, PTAS80, PAB079 , JOHN79, SAUE79, 
SAUE86, PAB082a,b, LEWI83, OHLE83, WEIS87a,b,C, REID88 , 
ANDE87, NEIiS86, ELIA85) . Both proteins comprise an amino- 
terminal DNA-binding domein having four homologous alpha 
helices. Helices 2 and 3 form the H-T-H motif. DNA 
30 contacts originate in helix 2, helix 3, and adjacent 
regions with helix 3 providing most of the contacts- The 
N- terminal domains of X repressor contact each other along 
helix 5 (PAB082b) while in 4 34 repressor the interdomain 
contacts are beyond helix 4, there being no helix 5 

3 5 (ANDE87). 

The operator DNA bends symmetrically in the 43 4 
represser-consensus operator co-crystal (ANDE87) . The 
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center of the 14 base pair DNA helix is over-wound and 
bends slightly along its axis such that it curls around the 
alpha 3 helix of each repressor monomer; the ends of the 
operator DNA helix are underwound. Bending of operator DNA 
5 has also been proposed in models of Cro protein and CAP 
protein operator binding (OHLESS, GART88) • Consistent with 
the results of Gartenberg and Crothers, bending of the 4 34 
operator toward Cro is toward the minor groove and occurs 
most readily when the central bases consist exclusively of 
10 A and T (KOUD87) ; in this case, substitution of CG base 
pairs greatly reduces binding* 

X Cro (TAKE77) has been described from an X-ray 
structure of the protein without DNA (ANDES 1) • Alpha helix 

15 2 lies across the operator major groove and may make 
contacts to operator backbone phosphates at its N-terminal 
and C-terminal ends. In addition, backbone phosphates may 
be contacted by residues at the C terminus of alpha 3, N 
terminus of beta 2, and C terminus of beta 3 (PAB084) . In 

2 0 computer model building of X Cro-operator DNA interactions, 
bending of operator DNA or bending at the monomer-monomer 
interface of the Cro dimer have been proposed to make the 
best fit between operator and dimer (PAB084) . 

2 5 Key amino acids within the H-T-H region of 4 34 Cro and 

X Cro are highly conserved (PAB084) , and 434 Cro binds 
operator DNA as a dimer (V7HAR85a) . Because the crystals of 
4 34 Cro and DNA do not diffract to high resolution, atomic 
details of the protein-DNA interactions are not revealed 

30 (W0LB88) , Nevertheless, Wolberger et al . report very 
significant similarities and differences between the DNA 
binding patterns of 434 repressor and 434 Cro. These 
observations on DBFs from 434, together with recent results 
on Trp repressor (OTWI88) , support the view that a) 

35 structural elements that fit into the major groove of DNA 
can function in a variety of closely related ways, b) 
bending of DNA complexed to proteins is an important 
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^ determinant of specificity, and c) that mechanisms of 
recognition may be quite subtle. 

Crystal structures have been deteinained for two DBFs, 
5 CRP (WEBE87a) and TrpR (OTWI88) from E. coli . Both these 
proteins contain H-T-H motifs and bind their cognate 
operators only when particular effector molecules are bound 
to the protein, cAMP for CRP and L-tryptophan for TrpR. 
Binding of each effector molecule causes a conformational 

10 v change in the protein that brings the DNA-recognizing 
elements into correct orientation for strong, sequence- 
specific binding to DNA (JOHN86) « The DNA-binding function 
of I*ac repressor is also modulated through protein binding 
of an effector molecule f e,a. lactose) ; unlike CRP and 

15 TrpR, Iiac repressor binds DNA only in the absence of the 
effector. CRP can act either as an activator (RENYSS) or 
as a repressor (P0LA88) depending on the relationship 
between the CRP-binding site and the rest of the promoter. 

2 0 Two structures of CRP (MCKA81, MCKA82) and one 

structure of a CRP mutant (WEBESVa) are available. 
Otwinowski et al. (0TWI88) have published an X-ray crystal 
:structure of TrpR bound to the Trp operator. This struc- 
ture shows that, although TrpR contains a canonical H-T-H 

2 5 motif, the positioning of the recognition helix with 

respect to the DNA is quite different from the positioning 
of the corresponding helix in other H-T-H DBFs (MATT88) for 
which structures of protein-DNA complexes are available. 
Unlike previously determined structures, most of the 

3 0 interactions between atoms of TrpR and bases are mediated 

by localized water molecules. It is not possible to 
distinguish between localized - water and atomic ions , such 
as Na*^, by X-ray diffraction alone. We shall follow 
Otwinowski et al. and refer to these peaks in electron 
-35 density as water, although ions cannot be ruled .out. 

Bass et--al. (BASS88) studied the binding .. of wild type ! 
TrpR and single amino acid missense mutants of TrpR to a . 
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consensus palindromic Trp operator and to palindromic 
operators that differ from the consensus by a symmetric 
substitution at one base in each half operator. Bass et 
al . conclude that the contact between the H-T-H motif of 
5 TrpR and the operators must be substantially different from 
the model that had been built based on the 434 Cro-DNA 
structure - 

Thus the binding of globular DBFs that are modulated 
10 by effector molecules is fundamentally the same as the 
binding of unmodulated globular DBFs, but the details of 
each protein's interactions with DNA are quite different. 
Prediction of which amino acids will produce strong 
specific binding is beyond the capabilities of current 
15 theory. Given the important role of localized waters or 
ions in the TrpR-DNA interface (OTWI88) and in the 434R-DNA 
interface (AGGA88) , such predictions are likely to remain 
beyond reach for some time. 

2 0 The Mnt repressor of P22 is an 82 residue protein 

that binds as a tetramer to an approximately palindromic 17 
base pair operator presumably in a manner that is two-fold 
rotationally symmetric. Although the Mnt protein is 4 0% 
alpha helical and has some homology to X C^ro protein, Mnt 

2 5 is known to contact operator DNA by N-terminal residues 
(VERS87a) and possibly by a residue (K79) close to the C 
temminus (KNIG88) . It is unlikely, therefore, that an 
H-T-H structure in Mnt mediates DNA binding (VERS87a) . 
Another residue {Y78) close to the C-terminal end has been 

30 found to stabilize tetramer formation (KNIG88) . Though the 
three dimensional structure of Mnt is not known, DNA- 
binding experiments have indicated that the Mnt operator, 
in B-form conformation, is contacted at major groove 
nucleotides on both front and back sides of the operator 

35 helix (VERS87a) . 

The Arc repressor of P22 is a 53 residue protein that 
binds as a dimer to a partially palindromic 21 base pair 
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operator adjacent to the mnt operator in P22 and protects a 
region of the operator that is only partially symmetric 
relative to the symmetric sequences in the operator 
(VERS87b) . Arc is 40% homologous to the N-terminal portion 
of Mnt, and the N-terminal residues of the Arc protein 
contact operator DNA such that an H-T-H binding motif is 
unlikely, as in Mnt binding (VERS86b) . The three dimen- 
sional structure of Arc, like Mnt, is not known, but a 
crystallographic study is in progress (JORD85) . DNA- 
binding experiments have shown that Arc probably binds 
along one face of B- form operator DNA. These experiments 
indicate that Arc contacts operator phosphates farther out 
from the center of operator symmetry than do the repressors 
or cro proteins of \ or 434, or P22 Mnt protein. Thus the 
researchers state that the operator DNA may be bent around 
Arc in binding or Arc dimer may have an extended structure 
to allow such contacts to occur (VERS87b) . These alterna- 
tives are not mutually exclusive. 

2° DNA-Bindincr PT-nt-. eins Other Than ReoressoT- Proteins 

Any protein (or polypeptide) which binds DNA may be 
used as an initial DNA-binding protein; the present method 
is not limited to repressor proteins, but rather includes 
25 other regulatory proteins as well as DNA-binding enzymes 
such as polymerases and nucleases. 

Derivatives of restriction enzymes may be used as 
initial DBPs. All known restriction enzymes recognize 
eight or fewer base pairs and cut genomic DNA at many 
places. Expression of a functional restriction enzyme at 
high levels is lethal unless the corresponding sequence- 
specific DNA-modifying enzyme is also expressed. EcoRl 
that lacks residues 1-29, denoted EcoRl-delN2 9, has no 
nuclease activity (JENJ86) ; EcoRl-delN29 binds sequence- 
specif ically to DNA that includes the EcoRl recognition 
sequence, GAATTC, (BECK88) . . - 



30 
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From the structure of R.EcoRI (MCCL86) , we can see 
that extension of the polypeptide chain at either the amino 
or carboxy terminus would allow contacts with base pairs 
outside of the canonical hexanucleotide . 

5 

Specifically, extending EcoRI (AT139 ) , EcoRI (GS140) , or 
EcoRI(RQ203) (YAN087) by, for example, ten highly varie- 
gated residues at the amino terminus and selecting for 
binding to a target such as, TGAATTCA or GGAATTCC, allows 

10 isolation of a protein having novel DNA-recognition 
properties. Alternatively, Eco RI may be extended at the 
amino terminus by addition of a zinc-finger domain. It may 
be useful to have two or more tandem repeats of the 
octanucleotide target placed in or near the promoter region 

15 of the selectable gene. Fox (FOXK88) has used DNase-I to 
footprint Eco RI bound to DNA and reports that 15 bp are 
protected. Thus, repeated octanucleotide targets for 
proteins derived from Eco RI should be separated by eight or 
more base pairs; one could place one copy of the target 

20 upstream of the -35 region and one copy downstream of the- 
10 region. There are many residues in Eco RI that contact 
the DNA as the enzyme wraps around it. These residues 
could be varied to alter the binding of the protein. To 
obtain acceptable specificity, we may need to pick as 

25 initial DBF a mutant of Eco RI that folds and dimeriztes, but 
that binds DNA weakly. The mutations in regions of the 
protein that contact DNA outside of the original GAATTC 
will confer the desired affinity and specificity on the 
novel' protein. 

30 

One may wish to obtain a protein that binds to one 
target DNA sequence, but not to other sequences that 
contain a subsequence of the target. For example, we may 
seek a protein that recognizes TGAATTCA, but not any of the 
3 5 sequences vGAATTCb. To achieve this distinction, we place 
the target sequence in the promoter region of the selec- 
table gene and one or more instances of the related 
sec[uences, to which we intend that the protein not bind, in 
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the promoter region of an essential gene, such as an 
antibiotic-resistance gene. 

Other stable proteins may also be used as initial 
DBFs, even if they show no DNA-binding properties. 
Parraga ^ aJU (Reference 8 in PARR88) report that Eisen et 
al^ have fused 229 residues of yeast ADRl to beta-galac- 
r-tosidase and that the fusion protein binds sequence-speci- 
fically to DNA in vitro . 



Adenovirus ElA protein turns on early viral genes as 
well as the human heat shock protein hsp70 (SIM088) . 
Further, a normal inducible nuclear DNA-binding protein 
regulates the IL-2 alpha inter leukin-2 receptor-R( alpha) 
gene and also promotes activation of transcription from the 
HIV-1 virus LTR (B0HN88). These studies indicate one of 
the many difficulties of designing antiviral chemotherapy 
by using the transcriptional regulatory apparatus of the 
virus as a target. This invention uses unique target 
20 sequences, not represented elsewhere in the host genome, as 
targets for suppression of gene expression. 

f^- The DNA sequences of operators that interact with 
proteins that control mating-type and cell-type specific 
transcription in yeast (MILLS 5) reveal that the consensus 
site for action of the alpha2 protein dimer is symmetric, 
while a heterodimeric complex of alpha2 and al subunits 
acts on an asymmetric site. The alpha2al-responsive site 
consists of a half -site that is identical to the alpha2 
half-site, and another half-site that is a consensus for al 
protein binding. The spacings between the symmetric and 
asymmetric sites are not the same. 

Antibodies that bind DNA and other nucleic acids have 
35 been obtained from human patients suffering- from Systemic 
Lupus "Erythematosus. -vMurine monoclonal antibodies have 
been. -obtained that specif ically . recognize . Z-DNA, B-DNA, 
/ssDNA, .. triplex DNA, . /and certain repeating . . sequences : 
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(ANDE88) . Anderson et al . (ANDE88) report that: 1) the 
antibodies studied contact six base pairs and four phos- 
phates, 2) antibodies are unlikely to provide some of the 
well known motifs for DNA-binding, e.g. helix-turn-helix, 
5 3) study of DNA-antibody complexes may yield insights into 
mechanisms of recognition, and 4) a DNA-recognizing 
antibody might be converted into a sequence or structure 
specific nuclease. The shortness of the contact makes it 
unlikely that high specificity can be attained, 

10 

Properties of serially-linked globular domains: 

A protein motif for DNA binding, present in some 
eukaryotic transcription factors, is the zinc finger in 

15 which zinc coordinately binds cysteine and histidine 
residues to form a conserved structure that is able to 
bind DNA (FRAN88) . Xenopus laevis transcription factor 
TFIIIA is the first protein demonstrated to use this motif 
for DNA binding, but other proteins such as human tran- 

2 0 scription factor SPl, yeast transcription activation factor 
GAL4 , and estrogen receptor protein have been shown to 
require zinc for DNA binding in vitro (EVAN88) . Other mam- 
malian and avian steroid hormone receptors and the adeno- 
virus ElA protein, that bind DNA at specific sites, contain 

2 5 cysteine-rich regions which may form metal chelating loops. 

Zinc-finger regions have been observed in the sequen- 
ces of a number of eukaryotic DBFs, but no high-resolution 
3D structure of a Zn-finger protein is yet available. A 

3 0 variety of models have been proposed for the binding of 

zinc-finger proteins to DNA (FAIR86, PARR88, BERG88, 
GIBS88) . Model building suggests which residues in the Zn- 
f ingers contact the DNA and these would provide the primary 
set of residues for variation. Berg (BERG88) and Gibson et 
3 5 al . (GIBS88) have presented models having many similarities 
but also some significant differences. Both models suggest 
that the motif comprises an antiparallel beta structure 
followed by an alpha helix and that the front side of the 
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helix contacts the major groove of the DNA. By assuming 
that conserved basic residues of the Zn-finger make contact 
:with phosphate groups in each copy of the motif, Gibson et 
al^ deduce that the amino terminal part of the helix makes 
5 direct contact to the DNA. The Gibson model does not, 
however, account well for the number of bases contacted by 
Zn-finger proteins. The observations on H-T-H proteins 
suggest that a DNA-recognizing element can interact in a 
variety of ways with DNA and we assert that a similar 

10 situation is likely in Zn-finger proteins. Thus, until a 
3D model of a Zn-finger protein bound to DNA is available, 
all of the residues modeled as occurring on the alpha helix 
away from the beta structure should be considered as 
primary candidates for variegation when one wishes to alter 

15 the ,DNA-binding properties of a Zn-finger protein. In 
addition, residues in the beta segment may control inter- 
actions with the sugar-phosphate backbone which can effect 
both specific and non-specific binding. 
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Parraga et al^ (PARRS 8) have reported a low-resolution 
structure of a single zinc-finger from NMR data. They 
confirm the alpha helix proposed by Berg and by Gibson et 
ai^, but not the antiparallel beta sheet. The models 
proposed by Klug and colleagues (FAIR86) have a common 
feature that is at variance with the models of Berg and of 
Gibson et al^, viz^ that the protein chain exits each 
finger domain at the same end that it entered. The 
structure published by Parraga et al^ does not settle this 
point, but suggests that the exit strand tends toward the 
end opposite from the entrance strand, thereby supporting 
the overall models of Berg and of Gibson et al^ Parraga et 
al^ also report that a) a chimeric molecule consisting of 
zinc-finger domains linked to beta-galactosidase binds 
sequence-specifically to DNA and b) a protein comprising 
only two finger motifs can bind sequence-^specif ically ..to 
DNA. They do not suggest that the . residues could be 
mutagenized to achieve novel recognition.; -^. 
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A protein composed of a series of zinc fingers offers 
the greatest potential of uniquely recognizing a single 
site in a large genome. A series of zinc fingers is not so 
well suited to development of a DBP that is sensitive to an 
5 effector molecule as is a more compact globular protein 
such as Ej^ coli GRP. Positive control of genes adjacent to 
the target DNA subsequence can be achieved as in the case 
of TF-IIIA. 

10 Overview: Variegation Strategy 

Choice of residues in parental potent ial-DBP to vary: 

We choose residues in the initial potential-DBP to 
15 vary through consideration of several factors, including: 
a) the 3D structure of the initial DBP, b) sequences 
homologous to the initial DBP, c) modeling of the initial 
DBP and mutants of the initial DBP, d) models of the 3D 
structure of the target DNA, and e) models of the complex 
2 0 of the initial DBP with DNA. Residues may be varied for 
several reasons, including: a) to establish novel recogni- 
tion by changing the residues involved directly in DNA 
contacts while keeping the protein structure approximately 
constant, b) to adjust the positions of the residues that 

2 5 contact DNA by altering the protein structure while keeping 

the DNA-contacting residues constant, c) to produce hetero- 
dimeric DBPs by altering residues in the dimerization 
interface while keeping DNA-contacting residues constant, 
and d) to produce pseudo-dimeric DBPs (see below) by 

3 0 varying the residues that join segments of dimeric DBPs 

while keeping the DNA-contacting residues and other 
residues fixed. 

If a dimeric protein comprises two identical polypep- 
35 tide chains related by a two-fold axis of rotation, we 
speak of a homodimer with two-fold dyad symmetry. When two 
very similar polypeptides fold into similar domains and 
associate, we may observe that there is an approximate two- 
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fold rotational axis that relates homologous residues, such 
as the alphal-betal dimer of haemoglobin. We refer to such 
a protein as a heterodimer and to the symmetry axis as a 
quasi-dyad. When we produce a single-chain DBP by fusing 
gene fragments that encode two DNA-binding domains joined 
by a linker amino acid subsequence, we call the molecule a 
pseudo-dimer and the axis that relates pairs of residues a 
pseudo-dyad . 

Principles th at cmide choice of residues to varv: 



■ A key concept is that only structured proteins 
- exhibit specific binding, i.e. can bind to a particular 
chemical entity to the exclusion of most others. In the 
15 case of polypeptides, the structure may require stabiliza- 
tion in a complex with DNA. The residues to be varied are 
chosen to preserve the underlying initial DBP structure or 
to enhance the likelihood of favorable polypeptide-DNA 
interactions. The selection process eliminates cells 
2 0 carrying genes with mutations that prevent the DBP from 
folding. Genes that code for proteins or polypeptides that 
"bind indiscriminately are eliminated since cells carrying 
-such proteins are not viable. Although preservation of the 
basic underlying initial DBP structure is intended, small 
changes in the geometry of the structure can be tolerated. 
For example, the spatial relationship between the alpha 3 
helix in one monomer of X Cro and the alpha 3 helix in the 
dyad-related monomer (denoted alpha 3') is a candidate for 
variation. Small changes in the dimerization interface 
can lead to changes of up to several fi in the relative 
positions of residues in alpha 3 and alpha 3 ' . 



Burial of hydrophobic surfaces so that bulk water is 
excluded is one of the strongest forces driving the 

35 folding of macromolecules and the binding of proteins i to 
other molecules . Bulk water can be excluded - from the 

• region between two molecules - or between two portions of a 
single molecule only if the surfaces are complementary. 
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The double helix of B-DNA allows most of the hydrophobic 
surface nucleotides to be buried. The edges of the bases 
have several hydrogen-bonding groups; the methyl group of 
thymine is an important hydrophobic group in DNA (HARR8 8) . 
5 To achieve tight binding, the shape of the protein must be 
highly complementary to the DNA, all or almost all hydro- 
gen-bonding groups on both the DNA and the protein must 
make hydrogen bonds , and charged groups must contact 
either groups of opposite charge or groups of suitable 
10 polarity or polarizability . 

There are two complementary interfaces of major 
interest: a) the DNA-protein interface and b) the interface 
between protein monomers of dimers or between domains of 
15 pseudo-dimers. The DNA-protein interface is more polar 
than most protein-protein interfaces, but hydrophobic amino 
acids ( e.g, F, L, M, V, I, W, Y) occur in sequence-specific 
DNA-protein interfaces. The protein-protein interfaces of 
natural DBFs are typical protein-protein interfaces. 

20 

Amino acids are classified as hydrophilic or hydro- 
phobic (ROSE85, EISE86a,b) , and although this classifica- 
tion is helpful in analyzing primary protein structures, it 
ignores that the side groups may contain both hydrophobic 

25 and hydrophilic portions, e.g. , lysine. Hydrogen bonds and 
other ionic interactions have strong directional behavior, 
while hydrophobic interactions are not directional. Thus 
substitution of one hydrophobic side group for another 
hydrophobic side group of similar size in an interface is 

3 0 frequently tolerated and causes subtle changes in the 
interface. For the purposes of the present invention, such 
hydrophobic-interchange substitutions are made in the 
protein-protein interface of DBFs so that a) the geometry 
of the two monomers in the dimer will change, and b) 

35 compensating interactions produce exclusively heterodimers . 

The process claimed here tests as many surfaces as 
possible to select one as efficiently as possible that 
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-^r binds to the target. The selection isolates cells produc- 
V ing those proteins that are more nearly complementary to 
^- the target DNA, or proteins in which intermolecular or 
intramolecular interfaces are more nearly complementary to 
5 each other so that the protein can fold into a structure 
that can bind DNA. The effective diversity of a variegated 
population is measured by the number of different surfaces, 
.J rather than the number of protein sequences • Thus we 
should maximize the number of surfaces generated in our 
10 population, rather than the ntimber of protein sequences, 
Proteins do not have distinct^ countable surfaces; there- 
fore, we define an interaction set as a collection of 
residues of a protein that can simultaneously touch the 
target DNA. 

15 

If N spatially separated residues of a protein are 
varied, 2 0 x N surfaces are generated. Variation of N 
residues in the same interaction set yields 20^ surfaces. 
.For example, if N = 6, variation of spatially separated 

2 0 residues yields 12 0 surfaces while variation of interacting 

residues yields 20^ = 6.4 x lo'^ surfaces. The process of 
varying residues in an interaction set to maximize the 
.^.number of surfaces obtained is referred to as Structure- 
directed Mutagenesis. 

25 

If the protein residues to be varied are close enough 
together in sequence that the variegated DNA (vgDNA) 
encoding all of them can be made in one piece, then 
cassette mutagenesis is picked. The present invention is 

3 0 not limited to a particular length of vgDNA that can be 

synthesized. With current technology, a stretch of 60 
amino acids (180 DNA bases) can be spanned. 

Mutation of residues further than sixty residues 
35 apart can be. achieved using other methods, such as single- 
str anded-ol igonucleot ide-direct ed mut agenes i s ( BOTS 8 5 ) and 
.two or more mutating . primers . 
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To vary residues separated by more than sixty resi- 
dues, two cassettes may be mutated serially. From 2 -fold 
to 1000-fold variegation is first introduced into a first 
cassette. We then introduce 1000-fold to 10^-fold varie- 
5 gation into a second cassette of the variegated vector 
population. The composite level of variation preferably 
does not exceed the prevailing capabilities to a) produce 
very large numbers of independently transformed cells or b) 
select small components in a highly varied population. 
10 The limits on the level of variegation are discussed 
below. 

Assembly of Relevant Data: 

15 Here we assemble the data about the initial DBF and 

the target that are useful in deciding which residues to 
vary in the variegation cycle: 

1) 3D structure, or at least a list of residues that 

2 0 contact DNA and that are involved in the dimer contact 

of the initial DBF, 

2) list of sequences homologous to the initial DBF, and 

2 5 3) model of the target DNA sequence. 

These data and an understanding of the function and 
structure of different amino acids in proteins will be 

3 0 used to answer three questions: 

1) which residues of the initial DBF are on the outside 
and close enough together in space to touch the target 
DNA simultaneously? 

35 

2) which residues of the initial DBF can be varied with 
high probability of retaining the underlying initial 
DBF structure? 
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- 3) whicti residues of the initial DBP can affect the 
dimerization or folding of the initial DBP? 

5 - Although an atomic model of the target material is 
preferred in such examination, it is not necessary. 

Graphical and computational tools; 

10 The most appropriate method of picking the residues of 

the protein chain at which the amino acids should be varied 
is by viewing with interactive computer graphics a model of 
the initial DBP complexed with operator DNA- A model based 
on X-ray data from the DNA-protein complex is preferred, 

15 but other models may be used- A stick-figure representa- 
tion of molecules is preferred. Suitable programs for 
viewing and manipulating protein and nucleic acid models 
include: a) PS-FRODO, written by To A. Jones (JONES 5) and 
distributed by the Biochemistry Department of Rice Univer- 

2 0 sity, Houston, TX; and b) PROTEUS, developed by Dayringer, 

Tramantano, and Fletterick (DAYR86) o Any hardware that 
supports either of these programs is appropriate. 

Use of Kno wledge of Mutations Affecting Protein 
25 Stability 

In choosing the residues to vary and the substitu- 
tions to be made for such residues, one may make use not 
only of modelling as described above but also of experi- 
30 mental data concerning the effects of mutation in the 
initial DNA-binding protein. Mutations which will markedly 
reduce protein stability are to be avoided in most cases, 

Missense mutations that decrease .DNA-binding protein 

3 5 function non-specif ically by affecting protein folding are 

distinguished from binding-specific :mutations primarily on 
; the; basis . of protein stability (NEI^83v3-PAKU86 , VERS86b, 
HECH84, HECH85a, and HECH85b) . . ^ 
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Tables 1, 12, and 13 summarize the results of a 
number of studies on single missense mutations in the 
three bacteriophage repression proteins: \ repressor 
5 (Table 12) (NELS83, GUARS 2 , HECHSSa, and NELS85) , X Cro 
(Table 1) (PAKU86, EISE85) , and P22 Arc repressor (Table 
13) (VERS86a, VERS 8 6b ) . The majority of the mutant 

sequences shown in Tables 1, 12, and 13 were obtained in 
experiments designed to detect loss of function in vivo . 
10 The second-site pseudo-reversion mutations (HECH85a) , and 
suppressed nonsense mutations (NEIiS83) , restore function, 
and some of the site specific changes (EISE85) produce 
functional proteins . 

15 Roughly 50-70% of the single missense mutations of 

the DNA-binding proteins selected for loss of function 
(Tables 1, 12, and 13) produce protein folding defects. 

Use — of Knowled ge of Mutations Affecting the DNA-Protein 

20 Interface 

Missense mutations in residues thought to be involved 
in specific interactions with DNA have been reported for 
several prokaryotic repressor proteins. Table 14 shows an 

2 5 alignment of the H-T-H DNA-binding domains of four prokary- 

otic repressor proteins ( from top to bottom : X repressor , 
X Cro, 434 repressor and trp repressor) and indicates the 
positions of missense mutations in residues that are 
solvent-exposed in the free protein but become buried in 

3 0 the protein-DNA complex, and that affect DNA binding. 

Randomly obtained missense mutations in solvent- 
exposed residues of X repressor, X Cro, and trp repressor, 
yield sets of mutants that reduce DNA binding (Table 14) • 
3 5 These sets correlate well to the sets of residues that are 
proposed to interact directly with DNA. Some mutations in 
X Cro (EISE85) and all those shown for 434 repressor 
(WHAR8 5a) were obtained through site-directed mutagenesis. 
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u . Most of the mutations shown in the \ and trp repressor 
sequences are trans -dominant when the mutant gene is 
present on an overproducing plasmid (NELS83, KELL85) • The 
exceptions to trans-dominance are the X repressor SP3 5 and 
5 . the trp repressor AT80 mutations. This latter change 
produces a repressor that has only slightly reduced binding 
(KELL85) . The trans-dominance observed for these mutations 
is proposed by the authors to result from the wild-type 
repressor and the mutant repressor forming mixed oligomers 
10: which are inactive in binding to operator sites • 

Wharton (WHAR85a) has reported that extensive site- 
directed mutagenesis of 434 repressor positions 28 and 29 
produced no functional protein sequences other than the 
15 wild-type. Apparently, in the context of 434 repressor 
structure and operators, only proteins with the wild-type 
Q28-Q29 sequence bind to the wild-type operators. 

Table 14 also shows missense mutations that result in 
2 0 near normal repressor activity. Substitution of 4 34 
: repressor Q33 with H, L, V, T, or A produces repressors 
;that function if expressed from overproducing plasmids 
(WHAR85a) ; repressor specificity is, however, reduced. 
Mutations in X repressor, QY33 (NELS83, HECH83) , and in X 
25 Cro, YF26 (EISE85) , produce altered proteins which make 
one less H-bond to the DNA and which bind to the operator 
DNA with reduced affinity. Thus, loss of a single H-bond 
is insufficient to completely abolish binding of DNA. 
Mutations YK26 and HR35 in X Cro show nearly normal binding 
30 (EISE85) , 

Nelson and Sauer (NELS85) and Hecht et al. (HECH85a, 
b) have described four replacements in X repressor (Table 
12): EK34, GN48, GS48, and EK83. These derivates have 
35 higher- affinity for O^l than w.t. X repressor. 



Extended amino* acid arms at N- and C-teirminal loca- 
tions are important DNA-binding structures in at least four 



wo 90/07862 



PCT/US90/00G24 



85 

prokaryotic repressors: X repressor and Cro, and P2 2 Arc 
and Mnt. 



Secfuence-specif ic and sequence-independent contacts 
5 are made by the first 6 amino acid residues (STKKKP) of the 
X repressor N-terminal region which form an "arm" that can 
wrap around the DNA (ELIA85, PAB082a) . Missense mutations 
KE4 and LP12 (Table 12) both greatly reduce repressor 
activity in vivo (NELSBS) . Deletion of the first six 

10 residues results in a protein which is non-functional in 
vivQ (ELIA85) • Deletion of the first three residues 
results in decrease of affinity for Oj^l, loss of protection 
of back side guanines, altered specificity between Oj^l and 
Oj^S, and decreased binding sensitivity to changes in 

15 temperature or salt concentration (ELIA85, PAB082a) . 



Missense mutations of P22 Arc that produce non- func- 
tional proteins with high intracellular specific protein 
levels (Table 13) are found only in the N-terminal 10 
20 residues of the protein (VERS86b) . A single residue change 
at position 6 (HP6) in P22 Mnt changes operator recognition 
in the altered protein (YOUD83, VERS86a,b). Knight and 
Sauer (cited in VERS86a,b) replaced the first 6 residues of 
Mnt repressor with the first 9 residues of Arc repressor to 

2 5 produce a repressor that binds to the arc operator but not 

to the mnt operator. Thus P22 Mnt and Arc use a recogni- 
tion region located in the first 6-10 amino-terminal 
residues for DNA recognition and binding. The N-terminal 
DNA-binding of these proteins can not be the recognition 

3 0 helix of a typical H-T-H motif. 



In X Cro, a C-terminal sequence (K62-K63-T64-T65-A66) 
has been suggested on the basis of model building (TAKE85) 
and NMR measurements (LEIG87) to form a flexible arm that 
3 5 interacts with minor groove phosphates. Eisenbeis and 
Caruthers (cited in KNIG88) have found that T64, T65, and 
A66 have minor effects on protein-operator affinity, while 
K63 is very important. The C-terminal sequence of P2 2 Mnt 
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> (K79-K80-T81-T82) is almost identical to that of X Cro. 

It has been shown (KNIG88) that deletion of the three 
: residues after K79 has little effect on protein structure 
• or DNA binding. Deletion of K79 and the distal residues, 
5' however, reduces operator binding by three orders of 
magnitude with little apparent change in protein structure. 

Use^ of Knowledge of Mutations Affecting the Protein- 
Protein Interface 

10 • 

It is also possible to modulate DNA-binding specifi- 
city by altering the protein-protein interface. Because 
the oligomerization equilibrium is coupled to DNA binding, 
mutations that alter oligomerization affect operator site 
15 affinity* Since oligomerization involves the matching of 
protein surfaces, many interactions are hydrophobic and 
mutations which specifically destabilize oligomerization 
are similar to mutations which destabilize global protein 
structure. Interactions at the site of oligomerization can 

2 0 influence the strength of interactions at the DNA-binding 

site by subtle alterations in protein structure, 

ri Use of Mutations That Affect Activation 

25 When X, 434^. and P22 repressors bind to their respec- 

tive Oj^2 sites, they activate transcription (POTE80, 
P0TE82, PTAS80). The site on X repressor which activates 
RNA polymerase is located on the N-terminal domain of the 
molecule (BUSH88, HOCH83, SAUE79) . Activation requires 

3 0 contact between the N-terminal domain of repressor at Oj^2 

and RNA polymerase CHOCH83, SAUE79) and this contact 
stimulates isomerization of the polymerase complex to the 
open form (McClure and Hawley, cited in GUAR82) • 

35 ;Missense mutations in X/ P22, or 434 repressors that 

specifically reduce. Pj^ activation while .leaving operator 
binding intact are in the solvent-exposed protein . surf ace 
closest to RNA polymerase bound - at Pj^ (GUAR82 , PAB079 ,^ 
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BUSH88, WHAR85a) . For \ and 434 repressor this surface 
includes residues in alpha helix 2 and in the turn between 
alpha helices 2 and 3. In P22 repressor, the surface is 
formed at the carboxyl terminus of alpha helix 3 (PAB079, 
5 TAKE83), In each repressor, the changes that reduce 
transcriptional activation at Pj^ involve the substitution 
of a basic residue for a neutral or acid residue. Further, 
missense mutations in \ and 434 repressors which increase 
transcription at involve the substitution of an acidic 

10 residue for a neutral or basic residue (GUAR82, BUSH88) . 

Transcriptional activation at Pj^ involves the 
apposition of a negatively charged surface on the N- 
terminal domain of X/ 434 , or P22 repressor to a site on 

15 RNA polymerase (BUSH88) . Mutations that a) alter the 
negatively-charged surface of repressor by removing acidic 
residues or by replacing them with basic residues, or b) 
that position the negative surface incorrectly with respect 
to RNA polymerase, decrease transcriptional activation at 

2 0 Prm* Alterations that produce a more negatively charged 
surface act to increase transcription at Pr^. 

Pick principal set of residues to vary: 

2 5 A huge number of variant DNA sequences can be gener- 

ated by synthesis with mixed reagents at chosen bases. 
Usually, it is necessary that the number of variants not 
exceed the number of independently transformed cells 
generated from the synthetic DNA. It is efficient, 

3 0 however, to make the number of variants as close as 

practical to this limit* The total number of variants is 
the product of the number of variants at each varied codon 
over all the variable codons . Thus , we first consider 
which residues could be varied with an expectation that 
3 5 alteration could affect DNA binding. We then pick a range 
of amino acids at each variable residue. The total number 
of variants is the product of these numbers. If the 
product is too large or too small, we alter the list of 
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residues and range of variation at each variable residue 
until an acceptable number is found- 

Considering which residues are on , the surface of the 
5 initial DBP, we pick residues that are close enough 
together on the surface of the initial DBP to touch a 
molecule of the target simultaneously without having any 
' initial DBP main-chain atom come closer than van der Waals 
- distance f viz . 4-0 to 5-0 S center to center) to any target 
10 - atom. For the purposes of the present invention, a residue 
of the initial DBP "touches" the target if: 

a) a main-chain atom is within van der Waals distance, 
viz* 4.0 to 5o0 fi, of any atom of the target molecule, 
15 b) the Cj^eta is within a specific distance of any atom of 
the target molecule so that a side-group atom could 
make contact with that atom, or 
c) there is evidence that altering the residue alters the 
V DNA-binding of the initial DBP. 

20 ^ 

The residues in the principal set need not be contig- 
uous in the protein sequence. The exposed surfaces of the 
" "Residues to be varied need not be connected. We prefer 
only that the amino acids in the residues to be varied all 
25 be capable of touching a single copy of the target DNA 
sequence simultaneously without atoms overlapping. 

In addition to the geometrical criteria, we prefer 
that there be indications that the initial DBP structure 
30 will tolerate substitutions at each residue in the princi- 
pal set of residues. Indications could come from various 
sources, including homologous sequences and modeling. 



35 



Pick a secondary set of residues to varv: . 

The secondary set comprises those residues t hot in the 
primary set that touch residues in the primary --iset. These 
residues might be excluded from the -primary set^bebause the 
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residue is : a) internal, b) highly conserved, or c) on the 
surface, but the curvature of the initial DBP surface 
prevents the residue from being in contact with the target 
at the same time as one or more residues in the primary 
5 set. 

Internal residues are frequently conserved and the 
amino acid type can not be changed to a significantly 
different type without risk that the protein structure will 

10 be disrupted. Nevertheless, some conservative changes of 
internal residues, such as I to L or F to Y, are tolerated. 
Such conservative changes affect the detailed placement and 
dynamics of adjacent protein residues and such variation 
may be useful to improve the characteristics of DBP 

15 binding. 

Surface residues in the secondary set are most often 
located on the periphery of the principal set. Such 
peripheral residues can not make direct contact with the 
2 0 target simultaneously with all the other residues of the 
principal set. It is appropriate to vary the charge of 
some or all of these residues. For example, the variegated 
codon containing equimolar A and G at base 1, eguimolar C 
and A at base 2, and A at base 3 yields amino acids T, A, 

2 5 K, and E with equal probability . 

Choice of residues to varv simultaneously: 

The allowed level of variegation determines how many 

3 0 residues can be varied at once; geometry determines which 

ones. The user may pick residues to vary in many ways; 
the following is a preferred manner. The user picks the 
objective of the variegation, vide supra . 

3 5 The number of residues picked is coupled to the range 

through which each can be varied. In the first round 
progressivity is not an issue; the user may elect to 
produce a level of variegation such that each molecule of 



wo 90/07862 



PCr/US90/00024 



90 

"^vgDNA is potentially different through, for example, un- 
-limited variegation of 10 codons (20^^ approx. = 10^^ 
different protein sequences) • The levels of efficiency of 
•ligation and transformation reduce the niomber of DNA 
5 sequences actually tested to between 10"^ and 10^. Multiple 
performances of the process with very high levels of 
variegation will not yield repeatable results; the user 
decides whether this is important. 

10 '^^ Pick range of variation: 

Each varied residue can have a different scheme of 
variegation, producing 2 to 20 different possibilities- We 
require that the process be progressive, i.e. each variega- 
15 tion cycle produces a better starting point for the next 
variegation cycle than the previous cycle produced. 

N.B. : Setting the level of variegation such that 
the parental pdbp and many sequences related to 

20 the parental pdbp sequence are present in 

detectable amounts insures that the process is 
progressive. If the level of variegation is so 
high that the frequency of the parental pdbp 
sequence can not be detected as a transf ormant, 

25 then each round of mutagenesis is independent of 

previous rounds and there is no assurance of 
progress ivity. This approach can lead to 
valuable DNA-binding proteins, but multiple 
repetitions of the process at this level of 

3 0 variegation will not yield progressive results. 

Excessive variegation is not preferred in 
subsequent iterations of this process . 

Progressivity is not an all-or-nothing property • So 
;35 long as most of the information obtained from previous 
.variegation cycles is retained and many different surfaces 
:..that are related to the parental DBP surface are produced, 
the process is progressive.- If the level of variegation is 
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so high that the parental dbp gene may not be detected, the 
assurance of progressivity diminishes . If the probability 
of recovering the parental DBP is negligible, then the 
probability of progressive results is also negligible, 

5 

An opposing force in our design considerations is 
that DBFs are useful in the population only up to the 
amount that can be detected; any excess above the detec- 
table amount is wasted* Thus we produce as many surfaces 
10 related to the parental DBP as possible within the con- 
straint that the parental DBP be present as a marker for 
the detection level . 

Mutagenesis of DNA: 

15 

We now decide how to distribute the variegation 
within the codons for the residues to be varied. These 
decisions are influenced by the nature of the genetic 
code. When vgDNA is synthesized, variation at the first 
2 0 base of a codon creates a population coding for amino acids 
from the same column of the genetic code table (Table 16) ; 
variation at the second base of the codon creates a 
population coding for amino acids from the same row of the 
genetic code table; variation at the third base of the 

2 5 codon creates a population coding for amino acids from the 

same box. Work with 3D protein structural models may 
suggest definite sets of amino acids to substitute at a 
given residue, but the method of variation may require 
either more or fewer kinds of amino acids be included. For 

3 0 example, substitution of N or Q at a given residue may be 

wanted. Combinatorial variation of codons requires that 
mixing N and Q at one location also include K and H as 
possibilities at the same residue. The present invention 
does not rely on accurate predictions of the amino acids to 
3 5 be placed at each residue, rather attention is focused on 
which residues should be varied. 



There are many ways to generate diversity in a 
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■^protein (RICH86^ CARU85, 0LIP86) . An extreme case is that 
one or a few residues of the protein are varied as much as 
possible (inter alia see CARU85, CARU87, RICH86, WHARSSa) . 
We will call this limit "Focused Mutagenesis", When there 
5 is no binding between the parental DBP and the target, we 
preferably pick a set of five to seven residues on the 
surface and vary each through all 20 possibilities. 

An alternative plan of mutagenesis ("Diffuse Mutagene- 
10 sis") that may be useful is to vary many more residues 
through a more limited set of choices (VERS86a^b, INOU86 
•(Ch.15)^ PAKU86) o This can be accomplished by spiking 
each of the pure nucleotides activated for DNA synthesis 
(e^.g^ nucleotide-phosphoramidites) with one or more of the 
15 other activated nucleotides. Contrary to general practice, 
the present invention sets the level of spiking so that 
only a small percentage ( 1% to <. 00001%, for example) of 
the final product will contain the parental DNA sequence. 
This will insure that the majority of molecules carry 
2 0 single, double^ triple, and higher mutations and, as 
required for progressivity , that recovery of the parental 
sequence will be a possible outcome. 

Let Nj-j be the number of bases to be varied/ and let Q 
25 be the fraction of all DNA sequences that should have the 
parental sequence, then M, the fraction of the nucleotide 
mixture that is the majority component, is 

M = exp{ log3(Q)/Ni3 } = 10 (lo^io (Q) /^b) ^ 

30 

If, for example, thirty base pairs on the DNA chain were to 
be varied and 1% of the product is to have the parental 
sequence, then .each mixed nucleotide substrate should 
contain 86% of the parental nucleotide and 14% of other 
35 nucleotides. Table 17 shows the fraction (fn) of DNA 
molecules having n non-parental bases when 3 0 bases are 
synthesized with reagents that contain fraction M of the 
majority component. When M=. 63096, f24 and higher are less 
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than 10~®o Note that substantial probability for 8 or more 
substitutions occurs only if the fraction of parental 
sequence (fo) drops to around 10"^. 

5 The N]^ base pairs of the DNA chain that are synthe- 

sized with mixed reagents need not be contiguous. They are 
picked so that between Nj^/S and codons are affected to 
various degrees. The residues picked for mutation are 
picked with reference to the 3D structure of the initial 

10 DBF, if known- For example, one might pick all or most of 
the residues in the principal and secondary set. We may 
impose restrictions on the extent of variation at each of 
these residues based on homologous sequences or other data. 
The mixture of non-parental nucleotides need not be random, 

15 rather mixtures can be biased to give particular amino acid 
types specific probabilities of appearance at each codon. 
For example, one residue may contain a hydrophobic amino 
acid in all known homologous sequences; in such a case, the 
first and third base of that codon would be varied, but the 

2 0 second would be set to T, This Diffuse Mutagenesis will 

reveal the subtle changes possible in the protein backbone 
associated with conservative interior changes, such as V to 
I, as well as some not so subtle changes that require 
concomitant changes at two or more residues of the protein. 

25 

Focused Mutagenesis: 

If we have no information indicating that a particular 
amino acid or class of amino acid is appropriate, we 

3 0 approximate substitution of all amino acids with equal 

probability because representation of one or a few pdbp 
genes above the detectable level is unproductive. Equal 
amounts of all four nucleotides at each position in a codon 
yields the amino acid distribution: 
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- 4/64 A 2/64 C 

■ 2/64 H 3/64 I 

" 4/64 P 2/64 Q 

1/64 V? 2/64 Y 

5 

This distribution has the disadvantage of giving two basic 
residues for -every acidic residue- Such predominance of 
basic residues is likely to promote sequence-independent 
DNA binding. In addition, six times as much R, S, and L as 
10^ W or M occur for the random distribution. Use of equimolar 
C and G at the third base reduces the over-representation 
of S, R, and L, but does not cure the maldistribution of 
acidics and basics. 

15 Consider the distribution of amino acids encoded by 

one codon in a population of vgDNA. Let Abun(x) be the 
abundance of DNA sequences coding - for amino acid x. For 
any distribution, there will be a most-favored amino acid 
(mfaa) with abundance Abun(mfaa) and a least-favored amino 

2 0 acid (Ifaa) with abundance Abun(lfaa) . We seek the 

nucleotide distribution that allows all twenty amino acids 
and that yields the largest ratio Abun (If aa)/Abun (mf aa) 
•subject to two constraints. First, the abundances of 
acidic and basic amino acids should be equal. Second, the 
25 number of stop codons should be kept as low as possible. 
Thus only nucleotide distributions that yield 

Abun (E)-f Abun (D) = Abun(R)+Abun(K) 

3 0 are considered, and the function maximized is: 

f (distribution) = 

{ (l-Abun(stop) ) (Abun (Ifaa) /Abun (mfaa) ) } . 

35 We limit the third base to equimolar T and G (C and G 

would be equivalent) . All amino acids are possible and the 
nximber of accessible stop codons is reduced. 



2/64 D 2/64 E 

2/64 K 6/64 L 

6/64 R 6/64 S 
3/64 stop 



2/64 F 4/64 G 
1/64 M 2/64 N 
4/64 T 4/64 V 



wo 90/07862 PCr/US90/00024 

95 

A computer program, "Find Optimum vgCodon^" (Table 
18), varies the composition at bases 1 and 2, in steps of 
0,05, and reports the composition that gives the largest 
value of f (distribution) subject to the constraints: 

5 

g2 = (gl*a2 - 0. 5*al*a2)/ (cl + 0o5*al), 
tl = 1 - al - cl - gl, and 
t2 = 1 - a2 - c2 - g2 

10 The first constraint requires equal amount of acidic and 
basic amino acids and the second and third conserve 
matter* 



We vary al, cl, gl, a2 , and c2 and then calculate tl, g2 , 
15 and t2 . Initially, variation is in steps of 5%. Once an 
approximately optimum distribution of nucleotides is 
determined, the region is further explored with steps of 
1%. The optimum distribution is: 



Optimum vaCodon 







T 


c 


A 


G 


base 


#1 = 


0.26 


0.18 


0.26 


0.30 


base 


#2 = 


0.22 . 


0.16 


0.40 


0.22 


25 base 


#3 = 


0.5 


0.0 


0.0 


0.5 



and yields DNA molecules encoding each type of amino acid 
with the abundances shown in Table 19 . 

30 The actual nucleotide distribution obtained in 

synthetic DNA will differ from the specified nucleotide 
distribution due to several causes, including: a) differen- 
tial inherent reactivity of nucleotide substrates, and b) 
differential deterioration of reagents. It is possible to 

35 compensate partially for these effects, but - some residual 
error will occur. We denote the average discrepancy 
between specified and obseirved nucleotide fraction as S^^-r-/ 
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Serr square root ( average[ (fobs * fspec)/fspec ] ) 

A: where fobs amount of one type of nucleotide found at 

; a base and f spec the amount of that type of nucleotide 
5 ::: that was specified at the same base. The average is over 
all specified types of nucleotides and over a number f e,a, 
10 to 50) of different variegated bases. By hypothesis, 
the actual nucleotide distribution at a variegated base 
will be within 5% of the specified distributiono Actual 
10 „DNA synthesizers and DNA synthetic chemistry may have 
different error levels « It is the user's responsibility to 
determine S^^rir "^^^ ^NA synthesizer and chemistry 

employed by the usero 

15 To determine the possible effects of . errors in 

nucleotide composition on the amino acid distribution, we 
modified the program "Find Optimum vgCodon" in four ways: 

1) the fraction of each nucleotide in the first two bases 
20 is allowed to vary from its optimum value times (1- 

^err) '^^ optimum value times (1 + S^j-^) in seven 

equal steps (S^^^ is the hypothetical fractional error 
r:: level) , maintaining the sum of nucleotide fractions 

for one codpn position at 1«0, 

25 

2) g2 is varied in the same manner as a2, i.e, we dropped' 
the restriction that Abun(D) + Abun(E) = Abun(K) + 
Abun (R) , 

3 0 3) t3 and g3 are varied from 0-5 times (1 - S^rr) '^^ ^-^ 
times (1 + Sqj-^) in three equal steps, 

4) the smallest ratio Abun (If aa) /Abun (mfaa) is sought. 

35 In actual experiments, „ we direct _ the synthesizer n^to 
produce .the optimum DNA distribution "Optimum vgCodon" 
given above. Incomplete . control oyer DNA chemistry may, 
however, cause us to actually obtain the following distri- 
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bution that is the worst that can be obtained if all 
nucleotide fractions are within 5% of the amounts specified 
in "Optimum vgCodon". A corresponding table can be 
calculated for any given S^^-j- using the program "Find worst 
5 vgCodon within S^j-j- of given distribution." given in Table 
20. 

Optimum vaCodon, worst 5% errors 

10 T c A G 

base #1 = 0.251 Ool89 0.273 0.287 

base #2 = 0.209 0.160 0.400 0.231 

base #3 = 0.475 0.0 0.0 0.525 

15 This distribution yields DNA encoding each of the twenty 
amino acids at the abundances shown in Table 21. 

Each codon synthesized with the distribution of bases 
shown above displays 4x4x2 = 2^ = 32 possible DNA 
2 0 sequences, though not in ec[ual abundances. An oligonucleo- 
tide containing N such codons would display 2^^ possible 
DNA sequences and would encode 2 0^ protein sequences. 
Other variegation schemes produce different numbers of DNA 
and protein sequences. For example, if two bases in one 

2 5 codon are varied through two possibilities each, then there 

are 2x2=4 DNA sequences and 2x2=4 protein sequen- 
ces . 

If five codons are synthesized with reagents mixed so 

3 0 as to produce the nucleotide distribution "Optimum vg- 

Codon", and if we actually obtained the nucleotide distri- 
bution "Optimum vgCodon, worst 5% errors", then DNA 
sequences encoding the mfaa at all of the five codons are 
about 277 times as likely as DNA sequences encoding the 
35 Ifaa at all of the five codons. Further, about 24% of the 
DNA sequences will have a stop codon in one or more of the 
five codons. 
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^ Consider variegation of a hypothetical sequence, F24- 

' G25-D26-E27-T28, in which each variegated codon is synthe- 
- sized as an "Optimal vgCodon". The actual abundance of the 
DNA encoding each type of amino acid is, however, taken 
5 - from the case of S^^^j- = 5% given in Table 21. The abun- 
dance of DNA encoding the parental amino acid sequence is: 
Amount (parental seq. ) 

F24 G25 D26 E27 T28 

= Abun(F) * Abun(G) * Abun(D) * Abun(E) * Abun(T) 
10" = o0249 X .0663 X o0545 X ,0602 X ,0437 

= 2.4 X 10"'^ 

Therefore, if the efficiency of the entire process allows 
' us to examine lo"^ different DNA sequences, DNA encoding the 
15 parental DBF sequence as well as very many related sequen- 
ces will be present in sufficient quantity to be detected 
and we are assured that the process will be progressive. 

Setting level of variegation; 

20 ^ 

We use the following procedure to determine whether a 
-given level of variegation is practical: 

1) from: a) the intended nucleotide distribution at each 
25 base of a variegated codon, and b) S^j-^ (the error 

level in mixed DNA synthesis) , calculate the abundan- 
ces of DNA sequences coding for each amino acid and 
stop, 

30 2) calculate the abundance of DNA encoding the parental 
DBF sequence by multiplying the abundances of the 
parental amino acid at each variegated residue, 

35 The abundances used in the procedure^ above are calculated 
from -the worst distribution that is within -S^i-r of the 
specified distribution, A variegation that ^-insures that 
the parental DBF sequence can be recovered is practical. 
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Such a level of variegation produces an enormous number of 
multiple changes related to the parental DBP available for 
selection of improved successful DBFs. We adjust the 
subset of residues to be varied and levels of variegation 
5 at each residue until the calculated variegation is within 
bounds. 

Reduction of gratuitous restriction sites; 

10 If the method of mutagenesis to be used is replacement 

of a cassette, we consider whether the variegation gener- 
ates gratuitous restriction sites. We reduce or eliminate 
gratuitous restriction sites by appropriate choice of 
variegation pattern and silent alteration of codons 

15 neighboring the sites of variegation. 

Focused mutagenesis: 

In the preferred embodiment of this process, the 
20 number of residues and the range of variation at each 
residue are chosen to maximize the number of DNA binding 
surfaces, to minimize gratuitous restriction sites, and to 
assure the recovery of the initial DBP sequence. For 
example, in Detailed Example 1, the initial DBP is X Cro. 
25 One primary set of residues includes G15, Q16, K21, ¥2 6, 
Q27, S28, N31, K32, H35, A36, and R38 of the H-T-H region 
(Table 14b) and C-terminal residues K56, N61, K62 , K63, 
T64, T65, and A66. A secondary set of residues includes 
L23, G24, and V25 from the turn portion of the H-T-H 
30 region, buried residues T20, A21, A30, 131, A34, and 135 
from alpha helices 2 and 3, and dimerization region 
residues E54, V55, F58, P59, and S60. 

The initial set of 5 residues for Focused Mutagenesis 
35 contains residues in or near the N-terminal half of alpha 
helix 3: ¥26, Q27, S28, N31, and K32. Varying these 5 
residues through all 20 amino acids produces 3.2 x 10^ 
different protein sequences encoded by 32^ (=3.3 x 10^) 
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different DNA sequences. Since all 5 residues are in the 
same interaction set, this variegation scheme produces the 
maximum number of different surfaces. Assuming optimized 
nucleotide distribution described above and S^j-j. = 5%, the 
5 probability of obtaining the parental sequence is 3.2 x 
10""7, This level is within bounds for synthesis, ligation/ 
transformation, and selection capable of examining 10^ 
sequences of vgDNA. Codons for the 5 residues picked for 
Focused Mutagenesis are contained in the 51 bp PpuM II to 
10 - Bql ll fragment of the rav"*" gene constructed in Detailed 
Example 1. 

' Repetition to obtain desired degree of DNA-bindina; 

15 The first variegation step can produce one or more 

DBFs having DNA-binding properties that are satisfactory to 
the user- If the best selected DBF is not fully satisfac- 
tory, parental DBFs for a second variegation step are 
picked from DBFs isolated in the first variegation step. 

2 0 The second and subsequent variegation steps may employ 

either Focused or Diffuse Mutagenesis procedures on 
residues of the primary or secondary sets. In the prefer- 
red embodiment of this process, the user chooses residues 
and mutagenesis procedures based on the structure of the 
25 parental DBF and specific goals. For example, consider 
three hypothetical cases. 

In a first case, a variegation step produces a DBF 
with greater non-specific DNA binding than is desired. 

3 0 Information from sequence analysis and modeling is used to 

identify residues involved in sequence independent inter- 
actions of the DBF with DNA in the non-specific complex - 
In the next variegation step, some or all of these resi- 
dues, together with one ^ or more additional residues from 
35 the primary set, are * chosen for Focused Mutagenesis • and 
additional residues from 'the primary or secondary sets are 
chosen for Diffuse Mutagenesis. 7. 
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In a second hypothetical case, a variegation step 
produces a DBP with strong sequence specific binding to 
the target and the goal is to optimize binding. In this 
case, the next variegation step employs Diffuse Mutagenesis 
5 of a large number of residues chosen mostly from the 
secondary set • 

In the third hypothetical case, a DBP has been 
isolated that has insufficient binding properties, A set 
10 of residues is chosen to include some primary residues that 
have not been subjected to variation, one or more primary 
residues that have been varied previously, and one or more 
secondary residues. Focused Mutagenesis is performed on 
this set in the next variegation step. 

15 

Overview: DNA Synthesis. Purification, and Cloning 
DNA sequence design: 

2 0 The present invention is not limited to a single 

method of gene design. The idbp gene need not be synthe- 
sized in toto; parts of the gene may be obtained from 
nature. One may use any genetic engineering method to 

produce the correct gene fusion, so long as one can easily 

25 and accurately direct mutations to specific sites. In all 
of the methods of mutagenesis considered in the present 
invention, however, it is necessary that the DNA sequence 
for the idbp gene be unique compared to other DNA in the 
operative cloning vector. If the method of mutagenesis is 

30 to be replacement of subsequences coding for the potential- 
DBP with vgDNA, then the subsequences to be mutagenized 
must be bounded by restriction sites that are unique with 
respect to the rest of the vector. If single-stranded 
oligonucleotide-directed mutagenesis is to be used, then 

3 5 the DNA sequence of the subsequence coding for the initial 

DBP must be unique with respect to the rest of the vector. 

The coding portions of genes to be synthesized are 
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designed at the protein level and then encoded in DNA. 
The amino acid sequences are chosen to achieve various 
. goals, including: a) expression of initial DBP intra- 
cellularly, and b) generation of a population of potential- 
5 DBFs from which to select a successful DBPo The ambiguity 
in the genetic code is exploited to allow optimal placement 
of restriction sites and to create various distributions of 
amino acids at variegated codons. 

10 V Organization of gene svnthesis; 

The present invention is not limited as to how a 
designed DNA sequence is divided for easy synthesis. An 
established method is to synthesize both strands of the 

15 entire gene in overlapping segments of 20 to 50 nucleotides 
(THER88) • An alternative method that is more suitable for 
synthesis of vgDNA is similar to methods published by 
others (0LIP86, OLIP87, AUSU87, KARN84)o Contrary to most 
previous workers, we: a) use two synthetic strands, and b) 

20 do not cut the extended DNA in the middle • Our goals are: 
a) to produce longer pieces of dsDNA than can be syn- 
thesized as ssDNA on commercial DNA synthesizers, and b) 
-•^ato produce strands complementary to single-stranded vgDNA* 
By using two synthetic strands, we remove the requirement 

25 for a palindromic sequence at the 3» end. Moreover, the 
overlap should not be palindromic lest single DNA molecules 
prime themselves. 

The present invention is not limited to any particular 
3 0 method of DNA synthesis or construction • Preferably, DNA 
is synthesized on a Milligen 7500 DNA synthesizer (Mil- 
ligen, Bedford, MA) by standard procedures. Synthetic DNA 
is purified by polyacrylamide gel electrophoresis (PAGE) or 
high-pressure liquid chromatography (HPLC) . The present 
35 invention is not limited to any ^particular method of 
purifying DNA for genetic engineering^ - 
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IDBP Gene cloning; 



We clone the idbp gene using plasmids that are trans- 
formed into competent bacterial cells by standard methods 
5 (MANI82) or slightly modified standard methods. DNA 
fragments derived from nature are operably linked to other 
fragments of DNA. 



10 



Cells transformed with the plasmid bearing the 
complete idbfi gene are tested to verify expression of the 
initial DBP. Selection for plasmid presence is maintained 
on all media, while selections for DBP+ phenotypes are 
applied only after growth in the presence of inducer 
appropriate to the promoter. Colonies that display the 
15 DBP+ phenotypes in the presence of inducer and DBP" 
phenotypes in the absence of inducer are retained for 
further genetic and biochemical characterization. The 
presence of the idbp gene is initially detected by restric- 
tion enzyme digestion patterns characteristic of that gene 
20 and is confirmed by sequencing. 

The dependence of the IDBP+ and IDBP" phenotypes on 
the presence of this gene is demonstrated by additional 
genetic constructions. These are a) excision of the idbp 

2 5 gene by restriction digestion and closure by ligation, and 

b) ligation of the excised idbp gene into a plasmid 
recipient carrying different markers and no db^ gene. 
Plasmids obtained by excising the gene confer the DBP" 
phenotypes (e^ Tc^, Fus^, and Gal^ in Detailed Example 
30 1) . Plasmids obtained from ligation of idbp to a recipient 
plasmid confer the DBP+ phenotypes in the presence of an 
inducer appropriate to the regulatable promoter ( e.g. Tc^, 
FusR, and Gal^^ in Detailed Example 1) . Finally, a most 
important demonstration of the successful construction 

3 5 involves determination of the quantitative dependence of 

the selected phenotypes on the exogenous inducer concentra- 
tion. 
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Overview: DNA—bindina Protiein Purif icatiion and 
Charactreriza-bion 

Isolation of IDBP; 

5 

We purify IDBP and its derivatives by standard 
methods, such as those described in JOHN80, TAKE86, JJEXGQ7 , 
VERS85b, KADO860 

10 Quantitation and characterization of protein-DNA bindincr: 

Methods that can be used to quant itate iand char- 
acterize sequence-specific and sequence- independent binding 
of a DBP to DNA include: a) filter-binding assays, b) 

15 electrophoretic mobility shift analysis, and c) DNase 
protection experiments, Ionic strength, pH,. and tempera- 
ture are important factors influencing DBP binding to DNA. 
Standard conditions should correspond closely to the 
anticipated conditions of use. Thus, if a binding protein 

20 is intended for use in bacterial cells in standard culture, 
a reasonable range of values from which to choose standard 
conditions would be: pH=7 . 5 to 8.0, 0.1 to 0.2 M KCl, and 
32^ to 37^C. Assay buffers preferably include cof actors, 
stabilizing agents, and counter ions for proper DBP 

25 function. 

We prepare DNA fragments for analysis of protein-DNA 
binding by methods that are very similar to those described 
in MAXA77, KLEN70, RIGB77, and KIMJ87. Filter-binding 

3 0 assays can yield thermodynamic (Kd) and kinetic (k^ and 
K^) constants and are performed by methods similar to those 
described by RIGG70, and KIMJ87 . Electrophoretic mobility 
shift measurements can also yield values of Kq, k^, and k^ 
and are performed by methods similar to those of FRIE81. 

'35 DNase protection assays use the methods' of JOHN7 9, MAXA77, 
FOXK88 . We use chemical methods to '^characterize * binding 
of proteins to DNA similar to the methods described in 
BRUN87, BUSH85, and JENJ86. 
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Table of Examples 

Ex> 1 Protocol for developing a new DNA-binding protein 

with affinity for a DNA-seguence found in HIV-l, 
5 by variegation of X Cro. 

Ex> 2 Protocol for developing a new DNA-binding 

polypeptide with affinity for a DNA-sequence 
found in HIV-1, by variegation of a polypeptide 
10 having a segment homologous with Phage P22 Arc. 

Ex. 3 Use of a custodial domain (residues 20-83 of 

barley chymotrypsin inhibitor) to protect a DNA- 
binding polypeptide from degradation. 

15 

Ex. 4 Use of a custodial domain containing a DNA- 

recognizing element (alpha-3 helix of Cro to 
protect a DNA-binding polypeptide from degrada- 
tion. 

20 

Ex. 5 Protocol for addition of arm to Phage P2 2 ARc to 

alter its DNA-binding characteristics. 

Ex. 6 ProtocQl for preparation of novel DNA-binding 

25 protein that recognizes an asymmetric DNA 

sequence and corresponds to a fusion of third 
zinc-finger domain of the Drosophila kr gene 
product and the DNA-binding domain of Phage P22 
Arc. 



30 







DETAILED EXAMPIiE 1 

35 Below is a hypothetical example of a protocol for 

developing a new DNA-binding protein derived from X 
with affinity for a DNA sequence found in human immunodefi- 
ciency virus type 1 (HIV-1) using E^ coli K-12 as the cell 
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^-line or strain. Further optimization, in accordance with 

the teachings herein, may be necessary to obtain the 

'^desired results. Possible modifications in the preferred 
method are discussed following various steps of the 
5 ' example - 

By hypothesis, we set the following technical capabil- 
ities: 

10 Yield from DNA synthesis 

500 ng/synthesis of ssDNA 100 bases long, 
10 ug/ synthesis of ssDNA 60 bases long, 
1 mg/ synthesis of ssDNA 20 bases long. 

15 Maximum oligonucleotide 

100 bases 

Yield of plasmid DNA 
1 mg/1 of culture medium 



20 



25 



Efficiency of DNA Ligation 
0.1 % for blunt-blunt, 

4 % for sticky-blunt, 
11 % for sticky-sticky. 

Yield of transf ormants 

5 X 10^ / ug DNA 



Error in mixed DNA synthesis (S^rr) 



30 5% 



Choice of cell line or strain; 

35 In this' example, the *^ following' E> coli " K-12 recA 

"^strains are used: ATCC #35,882 delta4 (Geribtyi^ei-^ w^iio 
^ ^ trbc^ recAV " rosL. supQ/ delta4 ( aai - chrb^p ai-a:^t^|j^ ) 
: and ATCC'- #33> 694 ; HBlOl*^^ (Genotype: F* f- \B}jCB r iivoK l^^ recA . 
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thi/ ara, lacY, aalK , xvl . intl, rpsL , supE , hsdS . (rg", 
) • coli K-12 strains are grown at 37^C in LB broth 

(MANI82, p440) and on LB agar (addition of 15 g Bacto-agar) 
for routine purposes. Selections for plasmid uptake and 
5 maintenance are performed with addition of ampicillin (Ap) 
(200 ug/ml) , tetracycline (Tc) (12-5 ug/ml) and kanamycin 
(Km) (50 ug/ml) . 
Choice of initial DBP: 

10 The initial DBP is X Cro. Helix-turn-helix proteins 

are preferred over other known DBFs because more detail is 
known about the interactions of these proteins with DNA 
than is known for other classes of natural DBP. X Cro is 
preferred over X repressor because it has lower molecular 

15 weight. Cro from 4 34 is smaller than X Cro, but more is 
known about the genetics and 3D structure of X Cro. An X- 
ray structure of the X Cro protein has been published, but 
no X-ray structure of a DNA-Cro complex has appeared. A 
mutant of X Cro, Cro67, confers the positive control 

2 0 phenotype in vitro but not in vivo. The contacts that 
stabilize the Cro dimer are known, and several mutations in 
the dimerization function have been identified (PAKU86) . 

By the methods disclosed herein, DBPs may be developed 
25 from Cro which recognize DNA binding sites different from 
the X Oj^3 or X operator consensus binding sites, including 
heterodimeric DBPs which recognize non-symmetric DNA 
binding sites. 

30 

Selections for phenotypes conferred by DBP *^ function: 

Media generally are supplemented with IPTG and 
antibiotic for selection of plasmid maintenance. Cell 
35 background is generally strain delta4 f aalK,T,E deletion) . 

a. Galactose resistance (Gal^) . Galactose epimerase 
deficient (galE") strains of coli (BUTT63) lyse when 
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.treated with galactose. Selective medivmi is supplemented 
with 2% galactose, added after autoclaving. Additional 
v galactose, up to 8%, somewhat reduces the background of 
artif actual galactose-sensitive colonies. 

5 . 

b. Galactose resistance selected immediately after 
transformation. Inducer IPTG is added to transformed 
cells, to 5 X 10"4 M at the start of the growth period, 
that allows expression of plasmid antibiotic-resistance, 

10 At 60 min after heat shock, cells are further diluted 10- 
fold into fresh LB broth containing IPTG, antibiotic to 
2 select for plasmid uptake ( e.g. Ap or Km), and 2% galac- 
tose. Cells are grown until lysis is complete or for 3 h, 
-whichever occurs first, then centrifuged at 6,000 rpm for 
15 xo min, resuspended in the initial volume of the post- 
transformation growth culture, and applied to medium for 
. further selection. 

c. Fusaric acid resistance (Tc^, Fus^) . Successful 

2 0 repression of tet yields resistance to lipophilic chelating 

agents such as fusaric acid (Fus^ phenotype) . Medium 
described by MAL081 is used for selection of fusaric acid 
resistance in coli; the amount of fusaric acid may be 
varied. Total cell inoculum is not greater than 5 x 10^ 
25 per plate. 

d. Fusaric acid resistance and galactose resistance. 
Galactose at a final concentration of 2% is added to the 
medium described by MAL081 after autoclaving. Cells 

3 0 selected directly for galactose resistance in liquid 

following transformation are applied to this medium. 

Selections for phenotypes conferred bv DBP " function: 

35 Cell, background is generally strain HBlOl (galK") . 

Media are generally supplemented with IPTG and antibiotic 
for plasmid maintenance. 
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a. Tc resistance. Medium, usually LB agar, is supple- 
mented with Tc after autoclaving. Tc stock solution is 
12.5 mg/ml in ethanol . It is stored at -2 0*=^ C, wrapped in 
aluminum foil. Petri plates containing Tc are also wrapped 

5 in foil. Minimum inhibitory concentration is 3.1 ug/ml 
using a cell inoculum of 5 x lo'^ to 10^ per plate. More 
stringent selections employ upto 50 ug/ml Tc. When used 
for selection of plasmid maintenance, Tc concentration is 
12.5 ug/ml. 

10 

b. Galactose utilization. Minimal A Medium (MILL72, 
p4 32) , with galactose as carbon source: after autoclaving 
add (per liter) 1 ml 1 M MgS04 , 0.5 ml of 10 mg/ml thiamine 
HCl, 10 ml of 20% galactose, and amino acids as required. 

15 Cell inoculum per plate is less than 5 x 10^. 

c. Tc resistance and galactose utilization. Medium A with 
galactose (section b. above) is supplemented with Tee at 
3 . 1 ug/ml • 

20 

Selectable svstems for DBP isolation: 

The tet gene from pBR32 2 and the coli galT.K genes 

25 are used in a aal deletion host strain for selection of DBP 
function. pKK175-6 (BROS84; Pharmacia, Piscataway, NJ) , a 
pBR322 derivative, contains the replication origin, bla 
(confers Ap^) for selection of plasmid maintenance, and 
tet, one of the two selectable genes (Figure 3.) In 
3 0 PKK175-6, tet is promoterless , and all DNA upstream of the 
pBR322 tet coding region that potentially allow transcrip- 
tion in both directions (BR0S82) have been deleted and 
replaced by the M13 mp8 polylinker. The polylinker and tet 
are flanked by strong transcription terminators from 
35 coli rrnB. tet is placed under control of the Tn5 neo 
promoter, Pneo* 
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Plasmid pAA3H (figure 4) (ATCC #37,308) (AHME84) 
r provides the second set of selectable genes, aalT.K . In 
Kgal deleted hosts (such as strain ATCC #35,882 carrying the 
• delta4 deletion (E^ eoli delta4 ) ) plasmid pAA3H confers 
5 the Ap^ Tc^ (Sal^ phenotype (AHME84) because part of galE 
is deleted. The aalT and aalK genes in pAA3H are tran- 
scribed from the "antitet" promoter (BR0S82) • In E. 
: coli strains carrying galT or aalK mutations f eoa. strain 
:HB101), pAA3H confers Gal"^o We place aalT "^ and aalK + under 
10 control of the pBR322 amp gene promoter* 

For both tet and gal systems, positive selections are 
used to select cells that either express or do not express 
-these genes from cultures containing a vast excess of cells 
15 of the opposite phenotype. 

Placement of test DNA binding sequence; 

The test DNA binding sequence for the IDBP, X Oj^3 
20 (KIMJ87) , is placed so that the first 5' base is the +1 
base of the mRNA transcribed in each of the tet and gal 
.transcription units (Table 100 and Table 101) . 

Engineering the idbo gene: 

25 

A DNA sequence encoding the wild-type Cro protein is 
designed such that expression is controlled by the lacUVS 
promoter. The DNA sequence departs from the wild- type cro 
gene sequence by the introduction of restriction sites. 
3 0 Thus, the gene is called rav . The transcriptional unit 
comprising PlacUVS, rav , and troA terminator is shown in 
Table 102. . 

Vector construction: 
•35 . . . 

The construction of an operative cloning :;;vector is 
summarized in Figure 5. The gal region of pAA3H requires 
manipulation before insertion into pKK175-6, First the \- 
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derived DNA between Hpa l and Eco RI is replaced with a Clal 
linker (New England BioLabs, #1037) • Standard methods are 
used and the resulting plasmid is named pEPlOOl (Figure 6) . 
All plasmids cited in the present application are cata- 
5 logued in Table 103. 



Next; we insert a synthetic fragment, shown in Table 
104, comprising the phage fd terminator and two restriction 
sites (Spel and Sf i l) into the Cla l site of pEPlOOl; the 

10 resulting plasmid is named pEP1002 (Figure 7) . Next, we 
replace the promoter upstream of gal with Pamp from 

PBR322. As shoen in Table 100, X Or3 is positioned 
downstream of Pamp so that it can be used to determine 
whether binding of Cro can prevent transcription of aalT>K . 

15 Restriction sites are provided to allow later alteration of 
the target sequence • The synthetic fragment is cloned into 
PEP1002 between Dralll and Bam HI > The resulting plasmid is 
named pEP1003 and confers Gal^ on delta4 cells. 

2 0 The aaii genes with the promoter and the fd terminator 

are moved from pEP1003 into pKK175"6. The 2,69 kb qalT,K - 
bearing Hpal fragment of pEP1003 is ligated to DNA obtained 
from PKK175-6 by partial Dra l digestion. Gal"^ colonies of 
transformed HBlOl cells are picked. The resulting plasmid 
25 is named pEP1004 (Figure 9) . 

The Tn5 neo gene promoter and Or3 are synthesized 
(Table 101) and inserted upstream of the tet coding region 
of PEP1004 between the unique Hindlll and Smal sites. 

3 0 Plasmid DNA from Ap^ Tc^ Gal^ colonies of transformed 

delta 4 cells is analyzed for an insert in the EcoRI-EcoRV 
fragment of pEP1004. The resulting 7.1 kb plasmid, with 
two separate selectable gene systems under control of two 
different promoters and the test DNA binding sequence, is 
35 designated pEPlOOS (Figure 10) . 
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' Clonincr the idbp gene; 

The BamHI site in the tet gene is removed from the tet 
gene in pEPlOOS by site-directed mutagenesis; the sequence 
5 TGG-ATC-CTC that codes for W97-I98-L99 is changed to TGG- 
ATA-TTG. DNA from pEPlOOS is linearized with Eco RV and 
part ( ca> 10%) of the DNA is made single stranded with 
exonuclease IIIo The mutagenic oligonuclotide shown in 
Table 105 is annealed to the DNA that is then completed 
10 with Klenow enzyme and ligated, Plasmid DNA from Tc^, Gal**" 
colonies of transformed HBlOl is analyzed by standard 
means; the resulting plasmid is named pEP1006o 

Synthetic DNA containing a Spe l overhang ^ followed by 
15 sequences for the lacUVS promoter, a ribosome binding site, 
cloning sites for idbp . the trpa terminator (ROSE79) , and 
an Sfil restricted end complementary to . the Sfil site in 
pEPlOOe is synthesized as six oligonucleotides as shown in 
Table 107 • We use the methods of THER88 to anneal and 
20 ligate these fragments into Spe l^ Sfi l cut pEPlOOe, 
Plasmid DNA from Ap^, Tc^, Gal^ colonies of transformed 
delta4 cells is examined for the Spe l- Sf i l insertion by 
restriction with Spe l. BstEII, Bglll, Kpnl, and Sfi l, The 
inserted DNA is verified by DNA sequencing, and the 7,22 kb 
. 25 plasmid containing the proper insertion is designated 
PEP1007, shown in Figure 11, 

The idbp gene sequence specifying the Cro"*" protein 
and designated rav in this Example, is inserted in two 

3 0 cloning steps. The BstEII-Bglll segment of rav (Table 109) 
is inserted first. Oligonucleotides olig#14 and olig#15 
are synthesized, annealed, and filled in with Klenow 
enzyme f Cf . KARN84) . The dsDNA is cut wit BstEII and Bglll 
and ligated to BstEII-Bglll cut pEP1007o The plasmid 

35 containing the appropriate partial rav /sequence ^ is desig- 
nated pEPlOOS. 
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The Bglll-ypnl fragment of rav is synthesized and 
inserted in the same manner as the fistEII-BgLLII fragment. 
(See Table 110, ) This plasmid carrying the complete rav 
gene is designated pEP1009, shown in Figure 12. 

5 

Determine whether IDBP is expressed: 

To determine whether cells carrying pEP1009 display 
the phenotypes expected for rav expression, the delta4 
10 strain bearing pEP1009 is tested on various Ap containing 
selective media with and without IPTG. Cells are streaked 
on LB agar media containing: a) Tc; b) fusaric acid; or c) 
galactose (vide supra ) . Control strains are the delta4 
host with no plasmid, and with pEPlOOS, pBR322, or pAA3H. 

15 

The results below indicate that the rav gene is 
expressed and the gene product is functional, and that 
expression is regulated by the lacUVS promoter. 



Growth of deri vatives of strain delta4 
on selective media (•¥ Ap) 



supplements: 


tetracvcl ine 


fusaric acid 


qalactose 


IPTG: 


4- - 






plasmid: 








pBR322 








pAA3H 








PEP1009 








PEP1005 









X cX" phage is streaked on each of the above strains, 
on LB agar with Ap, and with and without IPTGo At suffi- 
ciently high intracellular levels of Cro protein, binding 
35 of the Cro repressor protein to the X phage operators 

and Ol prevents phage growth* Data indicating correct 
expression and function of the rav gene are: 
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Growth of \ si." on delta4 cells 



Dlasmid 


Dhaae 


arowth 




+IPTG 


-IPTG 




+ 




PEP1009 






OEP1005 




+ 



_ These procedures indicate that the chosen IDBP, the 

10. product of the rav gene, is expressed and is successfully 
repressing both the test operators on the plasmid and the 
wild type operators on the challenge phage, 

DBF purification: 

15 

Proteins are purified as described by Leighton and Lu 
(LEIGS?). 

Quantitation of PEP binding: 

20 

We measure DBF binding to the target operator DNA 
sequence with a filter binding assay, initially using 
filter binding assay conditions similar to those described 
for X Cro (KIMJ87) • Data are analyzed by the methods of 
25 R1GG70 and KIMJ87o 

The target DNA for the assay is the 113 bp Apa l- Rsa l 
fragment from plasmid pEP1009 containing \ 0^3. A control 
DNA fragment of the same size, used to determine non- 
30 specific DNA binding, contains a synthetic Apa l -Xba l DNA 
fragment specifying the amp promoter and the sequence 

5 » CTTATACACGAAGCGTGACAA 3 • o 

35 This sequence preserves the base content , of . the : Or3 
sequence but lacks several sites of conserved sequence 
required for \ Cro binding (KIMJ87) and is cloned between 
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the ^I-Xbal sites of the pEP1009 backbone to yield 
pEPioio. 



Media For-mulations : 

5 

GalS is demonstrable in LB agar and broth at very low 
concentrations (0.2% galactose), and is optimal at 2 to 8% 
galactose. Galactose and Tc selections are performed in LB 
medium. Fus^ is best achieved in the medium described by 
10 Maloy and Nunn (MAL081) for 1^ coli K-12 strains. 



Induct ion of DBP expression; 



The EdbE gene is regulated by the lacUVS promoter. 
15 Optimal induction is achieved by addition of IPTG at 5 x 
10~4 M (MAUR8 0) . Experimentation for each successful DBF 
determines the lowest concentration that is sufficient to 
maintain repression of the selection system genes. 

20 Optimization of selections 



For each selective medium used to detect IDBP function, 
factors are varied to obtain a maximal number of transform- 
ants per plate and with a minimal number of false positive 
25 artif actual colonies. of greatest importance in this 
optimization is the transcriptional regulation of the 
initial potential-DBP, such that in further mutagenesis 
studies, ^ novo binding at an intermediate affinity is 
compensated by high level production of DBP. 



30 



Regulation of IDBP; 



Cells carrying pEPl009 are grown in LB broth with IPTG 
at 10-6^ 5 X 10-6, 10-5, 5 ^ 10-5^ ^q-^ and 5 x 10-4 
35 Samples are plated on LB agar and on LB agar containing 
fusaric acid or galactose as described in above. All media 
contain 200 ug/ml Ap, and the IPTG concentration of the 
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broth culture media are maintained in the respective 
selective agar media. 



The IPTG concentration at which 50% of the cells 
5 survive is a measure of affinity between IDBP and test 
operator, such that the lower the concentration, the 
greater the affinity, A requirement for low IPTG, e.g. 
10*^ M, for 50% survival due to Rav protein function 
suggests that use of a high level, e.g. 5 x 10""* M IPTG, 
10 employed in selective media to isolate mutants displaying 
de novo binding of a DBP to target DNA, will enable 
isolation of successful DBPs even if the affinity is low* 

Concentration of selective agents and cell inoculum size: 

15 

Fusaric acid and galactose content of each medium is 
varied, to allow the largest possible cell sample to be 
applied per Petri plate. This objective is obtained by 
applying samples of large numbers of sensitive cells ( e.g. 

20 5 X 10*^, 10^, 5 X 10^) to plates with elevated fusaric acid 
or galactose. Resistant cells are then used to determine 
the efficiency of plating. An acceptable efficiency is 80% 
viability for the resistant control strain bearing pEP1009 
in a delta4 background • The total cell inoculum size is 

25 increased as is the level of inhibitory compound until 
viability is reduced to less than 80%. 

Choice and cloning of target sequences: 

3 0 Sequences of the human immunodeficiency virus type 1 

(HIV-1) genome were searched for potential target sequen- 
ces. The known sequences of isolates of HIV-1 were 
obtained from the . GENBANK version 52.0 DNA sequence data 
base. First we found non-variable regions of HIV-1. We 

35 examined the - HIV-1 genome from the TATA sequence in the 
5'LTR of the HIV-1 genome to the end of the -sequence coding 
for the • tat and trs second exons. We intented to locate 
non-variable regions where a DBP can interfere with the 
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production of tat and/or trs mRNA because the products of 
these genes are essential in production of virus (DAYT86, 
FEIN8 6) . 

5 HIV-1 isolate HXB2 (RATN85) from nucleotide number 1 

through 6100 is the reference to which we aligned all other 
HIV-l isolates using the Nucleic Acid Database Search 
program (derived from FASTN (LIPM85) ) in the IBI/Pustell 
Sequence Analysis Programs software package (International 
10 Biotechnologies, Inc., New Haven, CT) . All stretches of at 
least 20 bases which have no variation in sequence among 
all HIV-1 isolates were retained as targets. 

From the alignment, segments of the HIV-1 isolate HXB2 
15 sequence that are non-variable among all HIV-1 sequences 
searched are: 



350 




371, 


759 




781, 


1494 




1519, 


2067 




2094, 


2615 




2650, 


3866 




3887, 


4370 




4404, 


4808 




4828, 


5030 




5074, 



519 




545, 


783 




805, 


1591 




1612 , 


2139 




2164 , 


2996 




3018 , 


4149 




4170, 


4533 




4561, 


4838 




4864 , 


5151 




5173 , 



623 




651, 


1016 




1051, 


1725 




1751, 


2387 




2427, 


3092 




3117, 


4172 




4206, 


4661 




4695, 


4882 




4911, 


5553 




5573, 



679 




697, 


1323 




1342, 


1816 




1837, 


2567 




2606, 


3500 




3523, 


4280 




4302, 


4742 




4767, 


4952 




4983, 


5955 




5991 



In the present Example, these potential regions were 
searched for subsequences matching the central seven base 

30 pairs of the X operators that have high affinity for X Cro 
t Viz- Or3, the symmetric consensus, and the Kim et al . 
consensus (KIMJ87) ) . The consensus sequence of Kim et al. 
has higher affinity for Cro than does Oj^3 which is the 
natural X operator having highest affinity for Cro. Cro is 

35 thought to recognize seventeen base pairs, with side 
groups on alpha 3 directly contacting the outer four or 
five bases on each end of the operator. Because the 
composition and sequence of the inner seven base pairs 
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affect the position and flexibility of the outer five base 
pairs to either side, these bases affect the affinity of 
Cro for the operator. 

5 The sequences sought are shown in Table 111. The 

letters "A" and "S" stand for antisense and sense. 
"OR3A/Syiniao Consensus. 5" is a composite that has Oj^3A at 
all locations except 5, where it has the symmetric consen- 
sus base, Co Similarly, "0R3A/Syram. Consensus .6" has the 
10 symmetric consensus base at location 6 and 0^3 A at other 
locations. 

A FORTRTkN program searched the non-variable HIV-1 
subsequence segments for stretches of seven nucleotides of 

15 which at least five are G or C and which are flanked on 
either side by five bases of non-variable HIV-1 sub- 
sequence. The 427 candidate seven-base-pair subsequences 
obtained using these constraints on CG content were then 
searched for matches to either the sense or anti-sense 

20 strand sequences of the five seven-base-pair sxibsequences 
listed above. None of the HIV-1 subsequences is identical 
to any of the seven-base-pair subsequences. Three HIV-1 
subsequences, shown in Table 112, were found that match six 
of seven bases. Eight subsequences, shown in Table 113, 

25 were found that match five out of seven bases and that have 
five or more GC base pairs. These HIV-1 siibsequences are 
less preferred than the HIV-1 subsequences that match six 
out of seven bases. 

1 

30 11111111 

1234 5678901234567 

5' aCtTTccGCTggGGaCt Bases 353-369 

actttccGCTggaaagt Left symmetrized 

agtccccGCTggggact Right symmetrized 

35 tatc AcCGCAAa Gaata Or3 
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(Lower case letters are palindromic in the two halves of 
the targets and Oj^3 ; highly conserved bases are bold and 
marked thus a. ) 

5 Among the outer five bases of each half operator, bases 1 
and 3 are palindromically related to bases 17 and 15 in 
Target HIV 3 53-3 69. 

I 

TCTCG AcGCAqG ACTCG Bases 681-697 

10 tctcgAcGCAgGcgaga Left symmetrized 

cgagtAcGCAgGactcg Right symmetrized 

tatcAcCGCAAgOgata Op3 

15 None of bases 1-5 are palindromically related to bases 13- 
17 in Target HIV 681-697. 

I 

TTTG AcTAGCGo AGGCT Bases 760-776 

tttgacTAGCGgtcaaa Left symmetrized 

2 0 agcctcTAGCGsaggct Right symmetrized 

tatcAcCGCAAgGgata 0^3 

None of bases 1-5 are palindromically related to bases 13- 
17 in Target HIV 7 60-77 6. 

25 

There is extensive sequence variability among the 
twelve phage X operator half -sites. For example: 

I 

tAtCa CCGCCGG tGaTa Consensus 

3 0 tAtCaCCGCaaGgGaTa Oj^3A 

The bases in lower case in Consensus and Oj^3 sequences 
shown above are more variable among various laitQ^doid 
operators than are bases shown by upper case letters. 
35 Studies of mutant operators indicate that A2 and C4 are 
required for Cro binding. In Target HIV 353-369, bases 
T3, C6, C7, G8, C9 , G14 , and A15 match the symmetric 
consensus sequence, but the highly conserved A2 and C4 are 
different from lambdoid operators and cro will not bind to 
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these subsequences. Mutagenesis of the DNA-contacting 
residues of alpha 3 is thus the first step in producing a 
DBP" that recognizes the left symmetrized or right sym- 
metrized target sequences, 

5 

Target HIV 353-369 is a preferred target because the 
core (underlined above) is highly similar to the Kim et al . 
consensus. Target HIV 760-776 is preferred over Target HIV 
681^697 because it is highly similar to Or3. 

10 

^ The method of the present invention does not require 
any similarity between the target subsequence and the 
original binding site of the initial DBP- The fortuitous 
existence of one or more subsequences within the target 
15 genes that has similarity to the original binding site of 
the initial DBP reduces the number of iterative steps 
needed to obtain a protein having high affinity and 
specificity for binding to a site in the target gene. 

20 Since the target sequence is from a pathogenic 

organism, we require that the chosen target subsequence be 
absent or rare in the genome of the host organism, e.g. the 
target subsequences chosen from HIV should be absent or 
rare in the human genome. 

25 

Candidate target binding sites are initially screened 
for their frequency in primate genomes by searching all DNA 
sequences in the GENBANK Primate directory (2,258,436 
nucleotides) using the IBI/Pustell Nucleic Acid Database 

3 0 Search program to locate exact or close matches. A similar 
search is made of the Ejs. coli sequences in the GENBANK 
Bacterial directory and in the sequence of the plasmid 
containing the idbo gene. The sequences of potential sites 
for which no matches are found are used to make oligonucle- 

35 otide probes for Southern analysis of human genomic DNA 
(S0UT75),. Sequences which do not specifically bind human 
-DNA are^ retained as target binding sequences. , 
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The HIV 353-369 left symmetrized and right symmetrized 
target subsequences are inserted upstream of the selectable 
genes in the plasmid pEPlOOS, replacing the test sequences, 
to produce two operative cloning vectors, pEPlOll and 

5 PEP1012, for development of Rav^ and Rav^ DBFs. The 
promoter-test sequence cassettes upstream of the ^et and 
aai operon genes are excised using Stul-airidlii and Apa l- 
Xbal restrictions, respectively. Replacement promoter- 
target sequence cassettes are synthesized and inserted into 

0 the vector, replacing Oj^3 with the HIV 353-369 left or 
right symmetrized target sequence in the sequences shown in 
Table 100 and Table 101. 



Choice of residues in Cro to vary: 

15 

The choice of the principal and secondary sets of 
residues depends on the goal of the mutagenesis. In the 
protocol described here we vary, in separate procedures, 
the residues; a) involved in DNA recognition by the 
20 protein, and b) involved in dimerization of the protein. 
In this section we identify principal and secondary sets of 
residues for DNA recognition and dimerization. 

Pick tarinc ipal set for DNA-recoanition : 

25 

The principal set of residues involved in DNA-recogni- 
tion is defined as those residues which contact the 
operator DNA in the sequence-specific DNA-protein complex. 
Although no crystal structure of a X Cro-operator DNA 

3 0 complex' is available, a crystal structure of a complex 
between the structural homolog 434 repressor N-terminal 
domain and a consensus operator has been described 
(ANDE87) . A crystal structure of Cro dimer has been 
determined (ANDES 1) and modeling studies have suggested 

35 residues that can make sequence-specific or sequence- 
independent contacts with DNA in sequence-specific com- 
plexes (TAKE83, OHLE83, TAKE85, TAKE86) . Isolation and 
characterization of Cro mutants have identified residues 
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which contact DNA in protein-operator complexes (PAKU8 6, 
HOCH86a,b, EISE85) . 

Important contacts with DNA are made by protein 
5 residues in and around the H-T-H region and in the C- 
terminal region. Hochschild et al. (HOCH86a,b) have 
presented direct evidence that Cro alpha helix 3 residues 
S28^ N31, and K32 make sequence-specific contacts with 
operator bases in the major grooveo Mutagenesis experi- 

10 ments (EISE85, PAKU86) and modeling studies (TAKE85) have 
implicated these residues as well* In addition, these 
studies suggest that H-T-H region residues Q16, K21, Y26, 
Q27, H35, A36, R38, and K39 also make contacts with 
operator DNA. In the C-terminal region, mutagenesis 

15 experiments (PAKU86) and chemical modification studies 
(TAKE86) have identified K56, and K62 as making contacts to 
DNA. In addition, computer modeling suggests , that the 5 to 
6 C-terminal amino acids of X Cro can contact the DNA 
along the minor groove (TAKE85) . From these considera- 

20 tions, we select the following set of residues as a 
principal set for use in variegation steps intended to 
modify DNA recognition by Cro or mutant derivative pro- 
teins: 16, 21, 26, 27, 31, 32, 35, 36, 38, 39, 56, 62, 63, 
64, 65, 66. 



Pick secondary set for DNA recognition; 

The residues in the secondary set contact or otherwise 
influence residues in the principal set. A secondary set 

30 for DltA recognition includes the buried residues of alpha 
helix 3: A29, 13 0, A3 3, and 134. Interactions between 
buried residues in alpha helix 2 and buried . residues in 
alpha helix 3 are known to stabilize H-T-H structure and 
residues in the turn between . alpha helix 2 and alpha helix 

35 3 of . H-T-H proteins are conserved among these . proteins- 
(PTAS86 pl02)*. In X , Cro these positions are .. T17, T19, 
A20, L23, G24, and; V25. Changes in ^the dimerization 
region , can influence binding . In X Cro , residues thought 



25 
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10 



15 



25 



30 



to be involved in dimer stabilization are E54, V55, and F58 
(TAKE85, PAB084). Finally, residues influencing the 
position of the C-terminal arm of X Cro are P57, P59, and 
S60. Thus the secondary set of residues for use in variega- 
tion steps intended to modify DNA recognition by X Cro or 
Rav proteins is: 17, 19, 20, 23, 24, 29, 30, 33, 34, 54, 
55, 57, 58, 59 and 60. 

Pick principal s et for diTnerization ; 



Different principal and secondary sets of residues 
must be picked for use in variegation steps intended to 
alter dimer interactions. In X Cro, antiparallel interac- 
tions between E54, V55, and K56 on each monomer have been 
proposed to stabilize the dimer (PAB084) . In addition, F58 
from one monomer has been suggested to contact residues in 
the hydrophobic core of the second monomer. Inspection of 
the 3D structure of X Cro suggests important contacts are 
made between F58 of one monomer and 140, A33, L23, V25, 
20 E54, and A52 . In addition, residues L7, 130, and L42 of 
one monomer could make contact with a large side chain 
positioned at 58 in the other monomer. Thus, a set of 
principal residues includes: 7, 23, 25, 30, 33, 40, 42, 52, 
54, 55, 56, and 58. 



Pick secondary set for dimerization: 



The secondary set of residues for variegation steps 
used to alter dimer interactions includes residues in or 
near the antiparallel beta sheet that contains the dimer 
forming residues. Residues in this region are E53, P57, 
and P59. Residues in alpha helix 1 influencing the 
orientation of principal set residues are KB, All, and M12. 
Residues in the antiparallel beta sheet formed by the beta 
35 strands 1, 2, and 3 (see Table l) in each monomer also 
influence residues in the principal set. These residues 
include 15, T6, K39, F41, V50, and Y51. Thus the set of 
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secondarY residues includes: 5, 6, 8, 11, 12, 41, 50, 51/ 
53V 57, and 59, 

Pick the range of variation for alt eration of DNA binding: 

5 

For the initial variegation step to produce a modified 
Rav protein with altered DNA specificity a set of 5 
residues from the principal set is picked. Focused 
Mutagenesis is used to vary all five residues through all 
10 twenty amino acids. The residues are be picked from the 
same interaction set so that as many as 3o2 x 10*^ different 
DNA binding surfaces will be produced • 

A number of studies have shown that the residues in 
15 the N-terminal half of the recognition helix of an H-T-H 
protein strongly influence the sequence specificity and 
strength of protein binding to DNA (HOCH86a,b, WHAR85, 
PAB084) o For this reason we choose residues ¥26, Q27, S28, 
N31, and K32 from the principal set as residues to vary in 
20 the first variegation step. Using the optimized nucleotide 
distribution for Focused Mutagenesis described above, and 
assuming that S^rr = ^* defined at the start of this 
Example, the parental sequence is present in the variegated 
mixture at one part in 3-1 x 10^ and the least favored 
25 sequence, F at each residue, is present at one part in 10^,>: 
Thus, this level of variegation is well within bounds for a 
synthesis, ligation, transformation, and selection system 
capable of examining 5 x 10^ DNA sequences* 

3 0 Pick the range of variation of residues for alt eration of 
dimerization: 

As described in the Detailed Description and in this 
Example, altered \ Cro proteins, RavL and Rav^, that bind 
35 specifically and tightly to left and right symmetrized 
targets derived from HIV 353-369, are ^first developed 
through one or more variegation: steps • . Sitis-specif ic 
changes are then engineered into rav ^ to produce dimeriza- 



wo 90/07862 PCT/US90/00024 

125 

tion defective proteins. Structure-directed Mutagenesis is 
performed on rav p^ to produce mutations in Ravp that can 
complement dimerization defective RaVL proteins and produce 
obligate heterodimers that bind to HIV 353-369. 

5 

One of the interactions in the dimerization region of 
X Cro is the hydrophobic contact between residues V55 of 
both monomers. The VF55 mutation substitutes a bulky 
hydrophobic side group in place of the smaller hydrophobic 

10 residue; other substitutions at residue 55 can be made and 
tested for their ability to dimerize. A small hydrophobic 
or neutral residue present at residue 55 in a protein 
encoded on expression by a second gene may result in 
obligate complementation of VF55. In addition, changes in 

15 nearby components of the beta strand, E53, E54, K56, and 
P57 may effect complementation. Thus a set of residues for 
the initial variegation step to alter the Rav^ dimer 
recognition is 53, 54, 55, 56, and 57. 



2 0 Another interaction in the dimerization region of X 

Cro is the hydrophobic contact between F58 of one monomer 
with the hydrophobic core of the other monomer. As 
mentioned above residues L7, L23, V25, A33, 140, L42, A52, 
and E54 of one^ monomer all could make contacts with a 
25 large residue at position 58 in the other monomer. The 
FW58 mutation inserts the largest aromatic amino acid at 
this position. Compensation for this substitution may 
require several changes in the hydrophobic core of the 
complementing monomer. Residues for Focused Mutagenesis in 

3 0 the initial variegation step to alter Ravj^ dimer recogni- 

tion in this case are: 23, 25, 33, 40, and 42. 



In each of the two cases described above, the initial 
variegation step involves Focused Mutagenesis to alter 5 
3 5 residues through all twenty amino acids. As was shown in 
Section 6.2.5, this level of variegation is within the 
limits set by using optimized codon distributions and the 
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5 



values for S^j.^ and transformation yield assumed at the 
start of this Example. 

Mutagenesis of DNA: 



Codons encoding X Cro residues Y26, Q27 , S28, N31, and 
K32 are contained in a 51 bp PpuH I to Bal l I fragment of the 
rav geneo To produce the cassette containing the varie- 
gated codons we synthesize the 66 nucleotide antisense 
10 variegated strand, olig#50, and the primer, olig#52: 

d 1 gvXXXaiX 
22 23 24 25 26 27 28 29 30 31 
5 » t cct aAG GAC CTA GGG GTG fzk fzk fzk GCG ATT fzk 
15 t Ppu^ It 



X a i h a g r k i 
32 33 34 35 36 37 38 39 40 
20 fzk GCC ATC CAT GCC GGC CGA AAG ATC Tt 3' olig#50 

3'-ccg get ttc tag aacgccgtg-5* olig#52 

The position of the amino acid residue in X Cro is shown 
25 above the codon for the residue. Unaltered residues are 
indicated by their lower case single letter amino acid _ 
codes shown above the position number. Variegated residues 
are denoted with an upper case, bold X. The restriction 
sites for PpuM l and Bal ll are indicated below the sequence. 

30 Since restriction enzymes do not cut well at the ends of; 
DNA fragments, 5 extra nucleotides have been added to the 
5' end of the cassette • These extra nucleotides are shown 
in lower case letters and are removed prior to ligating the 
cassette into the operative vector » The sequence "fzk" 

35 denotes the variegated codons and indicates that nucleotide 
mixtures optimized for codon positions 1, 2,; or 3 are , to be 
used, "f" is a mixture of 26% T, 18% :C, 26% :A, and. 3p% G, 
producing four possibilities, "z" is a mixture of 22% T, 
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16% c, 40% A, and 22% G, producing four possibilities. "k" 
is an eguimolar mixture of T and G, producing two possibil- 
ities. Each "fzk" codon produces 4x4x2 = 2^ = 32 
possible DNA sequences, coding on expression for 20 
5 possible amino acids and stop. The DNA segment above 
comprises (2^)5 = 2^5 x 10^ different DNA sequences 

coding on expression for 20^ = 3.2 x 10^ different protein 
sequences. 



10 After synthesis and purification of the variegated 

DNA, the oligonucleotides #50 and #52 are annealed and the 
resulting superoverhang is filled in using Klenow fragment 
as described by Hill (AUSU87, Unit 8.2). The double 
stranded oligonucleotide is digested with the enzymes PpuM I 

15 and figill and the mutagenic cassette is purified as 
described by Hill. The mutagenic cassette is cloned into 
the vectors pEPlOll and pEP1012 which have been digested 
with PpuMI and BamHI , and the ligation mixtures containing 
variegated DNA are used to transform competent delta4 

20 cells. The transformed cells are selected for vector 
uptake and for successful repression at low stringency as 
described above. Cells containing Rav proteins that bind 
to the left or right symmetrized targets display the Tc^, 
Fus^ and Gal^ phenotypes. 

25 

Surviving colonies are screened for correct DBP"^ and 
DBP" phenotypes in the presence or absence of IPTG as 
described above. Relative measures of the strengths of 
DBP-DNA interactions in vivo are obtained by comparing 

3 0 phenotypes exhibited at reduced levels of IPTGo DBP genes 
from clones exhibiting the desirable phenotypes are 
sequenced. Plasmid numbers from pEPllOO to pEPll99 are 
reserved for plasmids yielding rav^ genes encoding proteins 
that bind to the Left Symmetrized Targets carried on the 

35 plasmids. Similarly, plasmid numbers pEP1200 through 
PEP1299 plasmids containing rav j^ genes encoding proteins 
that bind to the Right Symmetrized Targets carried on these 
plasmids. 
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: Based on the determinations above, one or more Rav^ 

and Ravj^ proteins are chosen for further analysis in vitro . 
Proteins are purified as described above. Purified DBPs 
5 - are quantitated and characterized by absorption spec- 
troscopy and polyacrylamide gel electrophoresis. 

Xn vitro measurements of protein-DNA binding using 
^purified DBPs are performed as described in the Overview: 
10 DNA-Binding, Protein Purification, and Characterization and 
in this Example. These measurements determine equilibrium 
. ;binding constants (Kq) , and the dissociation (k^j) and 
association (k^) rate constants for sequence-specific and 
: sequence- independent DBP-DNA complexes- In addition, DNase 
15 protection assays are used to demonstrate specific DBP 
binding to the Target sequences. 

Estimates of relative DBP stability are obtained from 
measurements of the thermal denaturation properties of the 

2 0 proteins. Jn vitro measures of protein thermal stability 

are obtained from determinations of protein circular 
vdichroism and resistance to proteolysis by thermolysin at 
various temperatures (HECH84) or by differential scanning 
calorimetry (HECH85b) . 

25 

One or more iterations of variegation, involving 
residues thought capable of influencing DNA binding, of 
the rav j^ and rav p^ genes produce Ravj^ and Ravj^ proteins 
that bind tightly and specifically to the HIV 353-369 left 

3 0 and right symmetrized targets. Additional variegation 

steps, to optimize protein binding properties can be 
performed as outlined in the Overview: Variegation Stra- 
tegy. 

35 By hypothesis, we isolate pEP1127 that - contains pa 

pdbp gene that codes on expression for RavL-27, shown in 
Table 114, that binds the left-symmetrized ^^target best- 
among selected Rav^ proteins* Similarly , pEP1238 contains 
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a pdbp gene that codes on expression for Ravi^-38, shown in 
Table 115, that binds the right-symmetrized target best 
among selected Ravj^ proteins. 

5 We now use the genes for the Rav^ and Rav^ monomers as 

starting points for production of obligately heterodimeric 
proteins Rav^rRavj^ that recognize the HIV 353-369 target. 
First we change the target sequences in pEP1238 (containing 
ravR-38). We replace both occurrences of the Right 
10 Symmetrized Target (in tet and qalT.K promoters) with the 
HIV 353-369 target sequence, Delta4 cells containing 
plasmids carrying the HIV 353-3 69 targets display the Ap^, 
Tc^, FusS and Gal^ phenotypes. Plasmids carrying HIV 353- 
3 69 targets and the ^av^ gene are designated by numbers 
PEP14 00 through pEP14 99 and corresponding to the number of 
the donor plasmid of the 12 00 series; for example, replac- 
ing the target sequences in pEP123 8 produces pEP14 38. 



15 



20 



Engineer ing dimerization mutants of Rav j^; 

To create the site specific VF55 and FW58 mutations in 
ravx, we synthesize the two mutagenesis primers: 

a e e f k p f 
25 52 53 54 55 56 57 58 

5" GGC GAA GAG TTC AAG CCC TTC 3 <» VF55 

primer 



V k p w 
30 55 56 57 58 

5" GTA AAG CCC TGG 
primer 



p s n 
59 60 61 
CCC AGT AAC 3 " FW58 



Underlining indicates the varied codons and residues. The 
5 plasmid pEP1127 (containing ravL-27) is chosen for mutagen- 
esis. The gene fragment coding on expression for the 
carboxy-terminal region of the Rav^ protein is transferred 
into M13mpl8 as a BamH I to Kpn l fragment- Oligonucleotide- 
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directed mutagenesis is performed as described by Kunkel 
"(AUSU87, Unit 8.1) • The fragment bearing the modified 
region of Rav^ is removed from M13 RF DNA as the BamH I to 
Kpn l fragment and ligated into the correct location in the 
5 pEPllOO vector. Mutant-bearing plasmids are used to 
transform competent cells. Transformed cells are selected 
for plasmid uptake and screened for DBP" phenotypes (Tc^, 
Fus^^ and Gal^ in coli delta4; Gal"** in coli HBlOl) . 
"Plasmids isolated from DBP" cells are screened by restric- 
10 tion analysis for the presence of the rav ^ gene and the„ 
site-specific mutation is confirmed by sequencing • The 
plasmid containing the rav j^-27 gene with the VF55 mutation 
is designated pEP1301. Plasmid pEP1302 contains the rav ^- 
27 gene with the FW58 alteration. 

15 

For the production of obligate heterodimers as 
described below, the ravj;^" genes encoding the VF55 or FW58 
mutations are excised from pEPlBOl or pEP1302 and are 
transferred into plasmids containing the gene for Km and 

2 0 neomycin resistance f neo . also known as npt 2i) . These 

constructions are performed in three steps as outlined 
below- First, the neo gene from Tn5 coding for Km^ and 
■contained on a 1.3 Kbp Hinsilll to Sma l DNA fragment is 
ligated into the plasmid pSP64 (Promega, Madison, WI) 
25 which has been digested with both Hind lXI and ginal. The 
resulting 4.3 kbp plasmid, pEP13 03, confers both Ap and Km 
resistance on host cells. Next, the bla gene is removed 
from pEP1303 by digesting the plasmid with ^atll and Bal l . 
The 3.5 Kbp fragment resulting from this digest is purifi- 

3 0 ed, the 3' overhanging ends are blunted using T4 DNA 

polymerase (AUSU87, Unit 3.5), and the fragment is recir- 
cularized. This plasmid is designated pEP1304 and trans- 
forms cells to Km resistance. In the final step, the rav^" 
gene is incorporated in to pEP1304. Plasmid pEP1301 or 
35 PEP1302 is digested with Sfil and the resulting 3' over- 
hangs are blunted using T4 DNA polymerase. Next the 
linearized plasmid is digested with Spe l and the resulting 
5" overhangs are blunted using the Klenow enzyme reaction 
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(KLEN70). The ca . 340 bp blunt-ended DNA fragment contain- 
ing the entire rav^" gene is purified and ligated into the 
Pvull site in pEP1304. Transformed cells are selected for 
Km^ and screened by restriction digest analysis for the 
5 presence of rav^" genes. The presence of rav^" genes 
containing the site-specific VF55 or FW58 mutations is 
confirmed by sequencing. The plasmid containing the rav j^" 
gene with the VF55 mutation is designated pEP1305. The 
plasmid containing the rav j_" gene with the FW58 mutations 
10 is designated pEP1306. 

In a manner similar to the constructions described 
above, we ligate the original unmodified rav ^ gene into 
pEP13 04 to produce plasmid pEP1307. 



Engineer ing heterodimer binding of target DMA; 

This round of variegation is performed to produce 
mutations in RavR proteins that complement the dimer- 
ization deficient mutations in the Rav^ proteins produced 
above. To complement the FW58 mutation, the set of five 
residues L23, V25, A33, 140, and L42 are chosen from the 
primary set of residues as targets for Focused Mutagenesis. 

25 In an initial series of procedures to test for 

recognition of HIV 353-3 69 by the heterodimer Rav^iRavR, we 
transform cells containing pEP1438 (containing ravj^-3B and 
HIV 353-369 targets) with pEPl307 (containing rav ^^ . 
Intracellular expression of rav ^ and rav p produces a 

30 population of dimeric repressors: Rav^iRavj;^, RavL:RavR and 
RavpjrRavj^. If the heterodimeric protein is formed and 
binds to HIV 353-369, cells expressing both rav alleles 
will exhibit the Km^ Ap^ Gal^ Fus^ phenotypes (vide 
infra) . Several pairs of ravjj and ravj^ genes are used in 

35 parallel procedures; the best pair is picked for use and 
further study. Selections for binding the HIV 353-369 
target by the heterodimeric protein can be optimized using 
this system. 



wo 90/07862 



PCr/US90/00024 



132 

Focused Mutagenesis of residues 23, 25, 33, 40, and 42 
requires the synthesis and annealing of two overlapping 
variegated strands because in the rav gene a single 
5 cassette spanning these residues extends from the Bal l site 
to the BamH X site and exceeds the assumed synthesis limit 
of 100 nucleotides • As no variegation affects the overlap, 
the annealing region is complementary. The antisense 
strand of the DNA sequence from the Ball site blunt end to 
10 the end of the codon for G37 is denoted olig#53« 

q t k t a k d X g X y q 

16 17 18 19 20 21 22 23 24 25 26 27 

5 « C CAA ACC AAG ACA GCG AAG GAC f zk GGG f zk TAT CAG 
15 t P a lTt 

-s a inkXihag 

28 29 30 31 32 33 34 35 36 37 

AGC GCG ATT AAC AAG fzk ATC CAT GCC GGC 3" olig#53 

20 

f = (26% T, 18% C, 26% A, 30% G) 

z = (22% T, 16% C, 40% A, 22% G) 

= equimolar T and G 

25 01ig#53 contains vg codons for residues 23, 25, and 33 « 

Olig#54 is the sense strand from base 1 in codon 34 
to the BamH X site: 

i h a g r k X 
34 35 36 37 38 39 40 
TAG GTA CGG CCG GCA TTC j qm ' ' 

f x t i n a d n" k 
35 41 42 43 44 45 46 47 48 49 

AAG jqm TGG TAA TTG CiGA CTA CCT AGG cca ca 

fBamHit '^' ' 



30 

3° 



5^ olig#54''^ 
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16% T, 30% C) 
G, 40% 22% C) 

and C 



5 Olig#54 contains variegated codons for residues 40 and 42. 
Since olig#54 is the sense strand, the variegated nucle- 
otide distributions must complement the distributions for 
codon positions 1, 2, and 3 used in the antisense strand. 
These sense codon distributions are designated "j", "q"/ 

10 and "m", and represent the complements to the optimized 
codon distributions developed for codon positions l, 2, and 
3, respectively, in the antisense strand. The two strands 
(olig#53 and olig#54) share a 12 nucleotide overlap extend- 
ing from the first position in the codon for 134 to the end 

15 of the codon for G37. The overlap region is 66% G or C. 

The two strands shown above are synthesized, purified, 
annealed, and extended to foirm dsDNA. Following restric- 
tion endonuclease digestion and purification, the mutagenic 
20 cassettes are ligated into pEP1438 (containing the asym- 
metric HIV 353-369 target) in the appropriate locus in the 
ravR gene. The ligation mixtures are used to transform 
competent cells that contain pEP1306 (the plasmid with the 
ravL gene carrying the FW58 site-specific mutation) . 

25 

Above we picked a set of five residues in X Cro, E53, 
E55, V55, K56, and P57 , as targets for focused mutagenesis 
in the first variegation step of the procedure to produce a 
Ravj^ protein that complements the dimerization-def icient 
3 0 VF55 RavL mutation. These five residues are contained on a 
71 bp BamH I to Kpn l fragment of the rav gene (Table 100) . 
To produce a cassette containing the variegated codons we 
synthesize olig#58: 
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134 

gsvyaXXXXXf 
48 49 50 51 52 53 54 55 56 57 58 
ct gat GGA TCC GTC TAC GCG f zk fzk f zk f zk f zk TTC 
iBamHlj 

p s n k k 
59 60 61 62 63 
CCG AGT AAC AAA AAA 

10"^ t t a o 

64 65 66 67 
ACA ACA GCG TAA TAGTAGGTACC ta 3 ' olig#58 

15 After synthesis and purification of the vgDNAv 

strands are self-annealed using the 10 nucleotide palin- 
drome at the 3* end of the sequence. The resulting 
superoverhangs are filled in using the Klenow enzyme 
reaction as described previously and the double-stranded 

20 oligonucleotide is digested with BaniH X and Kpn io Purified 
mutagenic cassettes are ligated into one or more operative 
vectors (picked from the pEP1200 series) in the appropriate 
locus in the ravp ^ gene. The ligation mixtures are used to 
transform competent cells that contain pEP13 05 (the plasmid 

25 carrying the rav^" gene with the FV55 mutation) • 

Operative vectors carrying the VF55 or FW58 mutation 
in rav ^ confer Km resistance. Operative vectors carrying 
mutagenized ravp ^ genes contain the gene for Ap^ as well as 

30 the selective gene systems for the DBP"*" phenotypes. Cells 
containing complementing mutant proteins are selected by 
requiring both Ap^ and Km^ and repression of the complete 
HIV 343-369 target sequence (substituted for the Left and 
Right Symmetrized Targets in the selection genes) , Cells 

35 possessing the desired phenotype are Ap^, Km^, Fus^, and 
Gal^ (in E^^ coli delta4) . 
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Plasmids from candidate colonies are first isolated 
genetically by transformation of cells at low plasmid 
concentration. Cells carrying plasmids coding for Rav^ 
proteins will be Km^, while cells carrying plasmids coding 
5 for RavR proteins will be ApR. Plasmids are individually 
screened to ensure that they confer the DBF" phenotype and 
are characterized by restriction digest analysis to confirm 
the presence of xav^- or rav^" genes. Plasmid pairs are 
co-tested for complementation by restoration of the DBP+ 
phenotype when both rav^ and rav^ are present intracel- 
lularly. Successfully complementing plasmids are sequenced 
through the rav genes to identify the mutations and to 
suggest potential locations for optional subsequent rounds 
of variegation. 



10 



15 



20 



Plasmids containing genes for altered Rav^ proteins 
that successfully complement the rav ^ VF55 mutation are 
designated by plasmid numbers pEPlSOO to pEPl599. Similar- 
ly, plasmids containing genes for altered Rav^ proteins 
that successfully complement the rav ^ FW58 mutation are 
designated by plasmid numbers pEPieoo to pEP1699. 



Heterodimeric proteins are purified and their DNA- 
binding and thermal stability properties are characterized 

25 as described above. Pairwise variation of the Ravp and 
RavL monomers can produce dimeric proteins having different 
dimerization or dimer-DNA interaction energies. in 
addition, further rounds of variegation of either or both 
monomers to optimize DNA binding by the heterodimer, 

3 0 dimerization strength or both may be performed. 

In this manner a heterodimeric protein that recognizes 
any predetermined target DNA sequence is constructed. The 
foregoing is hypothetical. The sequences shown as the 
3 5 result of selection are given by way of example and must 
not be construed as predictions that proteins of the stated 
sequence will have specific affinity for any DNA sequence. 
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Example 2 

5 Presented below is a hypothetical example of a 

protocol for developing new DNA-binding polypeptides, 
derived from the first ten residues of phage P22 Arc and a 
segment of variegated polypeptide with affinity for DNA 
subsequences found in HIV-1 using E^, coli K12 as the host 
10 cell line. Some further optimization, in accordance with 
the teachings herein, may be necessary to obtain the 
desired results. Possible modifications in the preferred 
method are discussed immediately following the hypothetical 
example. 

15 

We set the same hypothetical technical capabilities as 
used in Detailed Example 1. 

bverview: 

20 

To obtain significant binding between a genetically 
encoded polypeptide and a predetermined DNA subsequence, 
^the surfaces must be complementary over a large area, 1000 
fi2 to 3 000 S^. For the binding to be sequence-specific, 

25 the contacts must be spread over many (12 to 20) bases. An 
extended polypeptide chain that touches 15 base pairs 
comprises at least 25 amino acids. Some of these residues 
will have their side groups directed away from the DNA so 
that many different amino acids will be allowed at such 

30 residues, while other residues will be involved in direct 
DNA contacts and will be strongly constrained. Unless we 
have 3D structural data on the iDinding of an initial 
polypeptide to a test DNA subsequence^ we can not a priori 
predict which residues will have their side 'groups directed 

35 toward the DNA and which will have their side groups 
directed outward. We also can not predidt which amino 
acids should be used to specifically bind particular base 
pairs. Current technology allows production of 10*^ to 10^ 
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independent transfonnants per ug of DNA which allows 
variation of 5 or 6 residues through all twenty amino 
acids. Alternatively, between 23 and 3 0 two-way variations 
of DNA bases can be applied that will affect between 8 and 
5 30 codons. 



10 



Sauer and colleagues (VERS87b) have shown that P22 
Arc binds to DNA using a motif other than H-T-H. There is 
as yet no published X-ray structure of Arc, though the 
protein has been crystallized and diffraction data have 
been collected (JORD85) . A combination of genetics and 
biochemistry indicates that the first 10 residues of each 
Arc monomer (M-K-G-M-S-K-M-P-q-F) bind to palindromically 
related sets of bases on either side of the center of 
15 symmetry of the 21 bp operator shown in Table 2 00. 
Furthermore, the first ten residues of each Arc monomer 
assume an extended conformation (VERS87b) . The hydrophobic 
residues may be involved in contacts to the rest of the 
protein, but there are several examples from H-T-H DBFs of 
2 0 hydrophobic side groups being in direct contact with bases 
in the major groove. We do know that these first ten 
residues of Arc can exist in a conformation that makes 
sequence-specific favorable contacts with the arc operator. 

25 We pick a target DNA subsequence from the HIV-i genome 

such that a portion of the chosen sequence is similar to 
one half-site of the arc operator. We use part of this 
chosen sequence for an initial chimeric target. One half 
of the first target is the DNA subsequence obtained from 
HIV-l and the other half of the target is one half-site of 
the arc operator. For this example, we will use a plasmid 
bearing wild-type arc operators repressed by the Arc 
repressor as a control. After demonstrating that Arc 
repressor can regulate the selectable genes, we replace the 
35 wild-type arc operator with the target DNA subsequence. We 
then replace the arc gene with a variegated pdbp gene and 
select for cells expressing DBFs that can repress the 
selectable genes. 



30 
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Once a protein is obtained that binds to the target 
that has similarity to one half of the Arc operator, we can 
change the target so that it has less similarity to one 
5 half of the Arc operator and mutagenize those residues that 
correspond to residues 1-10 of Arco In vivo selection will 
isolate a protein that binds to the new targets A few 
repetitions of this process can producer a polypeptide that 
binds to any predetermined DNA sequence • 

10 

Our potential DNA-binding polypeptide (DBP) will be 3 6 
residues long and will contain the first ten residues of 
Arc which are thought to bind to part of the half operator, 
DNA encoding the first ten amino acids of Arc is linked at 
15 the 3 • terminus of this gene fragment to vgDNA that encodes 
a further 26 amino acids • Twenty-four of the codons encode 
two alternative amino acids so that 2^^ = approxo lo6 x lo"^ 
protein sequences results The amino acids encoded are 
chosen to enhance the probability that the resulting poly- 
20 peptide will adopt an extended structure and that it can 
make appropriate -contacts with DNA« The Chou-Fasman 
(CHOU78a, CHOU78b) probabilities are used to pick amino 
^^cids with high probability of forming beta structures (M, 
V, I, C, F, Y, Q, W, R, T) ? the amino acids are grouped 
25 into five classes in Table 16. In addition, to discourage 
sequence-independent DNA binding, some acidic residues 
should be included* Glutamic acid is a strong alpha helix 
former, so in early stages we use D exclusively* Further, 
S and T both can make hydrogen bonds with their hydroxyl 
3 0 groups, but T favors extended structures while S favors 
helices; hence we use only T in the initial phase. 
Likewise, N and Q provide similar functionalities on their 
side groups, but Q favors beta and so is used exclusively 
in initial phases. Positive charge is provided by K and R, 
35 but only R is used in the variegated portion* Alanine 
favors helices and is excluded. P kinks the -chain and is 
allowed only near the carboxy terminus in initial itera- 
tions • 
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After one selection, we design a different set of 
binary variegations that includes the selected sequence 
and perform a second mutagenesis and selection. After two 
or more rounds of diffuse variegation and selection, we 
choose a subset of residues and vary them through a larger 
set of amino acids. We continue until we obtain sufficient 
affinity and specificity for the target. None of the 
polypeptides discussed in this example is likely to have a 
defined 3D structure of its own, because they are all too 
short. Even if one folded into a definite structure, that 
structure is unlikely to be related to DNA-binding. A 3D 
structure, obtained by X-ray diffraction or NMR, of a DNA- 
polypeptide complex would give us useful indications of 
15 which residues to vary. Scattering the variegation along 
the chain and sampling different charges, sizes, and 
hydrophobicities produces a series of proteins, isolated by 
in vivo selection, with progressively higher affinity for 
the target DNA sequence. 



20 



Cpnstruction of the test plasmid: 



Selection systems are the same as used in Example 1, 
viz. fusaric acid to select against cells expressing the 
25 :tet gene and galactose killing by aalT.K in a aalE deleted 
host. First, in three genetic engineering steps, we 
replace; a) the rav gene in pEPl009 with the arc gene, and 
b) the target DNA sequences (both occurrences) with the arc 
operator. The resulting plasmid is our wild type control. 

30 

To replace rav with arc, the synthetic arc gene, 
shown in Table 201 and Table 202, is synthesized and 
ligated into pEP10G9 that has been digested with BstEIl 
and jSpnl. Cells are transformed and colonies are screened 
35 for Tc^. The plasmid is named pEP2000. Delta4 cells 
transformed with pEP2000 are Tc^ and Gal^ because pEP2000 
lacks the rav gene. 
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To insert the arc operator into the neo promoter 
(^neo) tet gene in pEP2000, we digest pEP2000 with 

Stu i and Hindlll and ligate the purified backbone to 
annealed synthetic olig#430 and olig#432« - 

Arc operator and Pneo that promotes tet 



5 « I CCT 1 GCG I AAC 1 CGG | AAT [ TGC | CAG | - 
Olig #430 = 3* gga cgc ttg gcc tta acg gtc- 
10 I StuI I I "35 I 

I CTG I GGG 1 CGC 1 CCT I CTG 1 GTAl AGG I TTG 1 - 
gac ccc gcg gga gac cat tec aac- 

I -10 I 



GGA I ATG I ATA | GAA [ GCA | CTC | TAC | TAT | A 3 « =:Olig#4 32 

cct tac tat ctt cgt gag atg ata t teg a 5* - ... ^ 
J Arc operator | |Hind3 | - 



The plasmid is named pEP2001 and confers Fus^, Gal^, Ap^ on 
delta4 cells* 

To insert the arc operator into the amp promoter for 
25 the aalT,K genes in pEP2001/ we digest pEP2001 with Apa l 
and Xba l and ligate the purified backbone to synthetic 
olig#416 and olig#417 that have been annealed in the 
standard way. 
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Arc operator and P^jup that promotes aalT.K 

5' |CTT|CTA|AAT|ACA|TTC|AAA|- 
01.ig#417 3' c egg gaa gat tta tgt aag ttt- 
5 I Apal I I -35 I 

|TAT|GTA|TCCjGCT|CAT|GAG|ACA|ATA|ACC|- 
ata cat agg cga gta etc tgt tat tgg- 

|CTT|ATG|ATA|GAA|GCA|CTC|TAC|TAT| CGT 3'01ig#416 
gaa tac tat ctt cgt gag atg ata gca gat c 5 • 
J Arc Qperato-r | XbaX | 

15 The plasmid is named pEP2002 and confers Gal^, Fus^, Ap^ on 
fle?,ta4 cells • This plasmid is our wild type for work with 
polypeptides that are selected for binding to target DNA 
subsequences that are related to the arc operator. 

20 Development of polvijeptid e s that bind chimeric target DNA: 
We now replace: 



10 
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a) the two occurrences of the arc operator with the 
first target sequence that is a hybrid of the arc 
operator and a subsequence picked from HIV-1, and 

b) the arc gene by a variegated pdbp gene« 

A hybrid non-pal indromic target sequence is used in 
this example because selection of a polypeptide using a 
palindromic or nearly palindromic target DNA subsequence is 
likely to isolate a novel dimeric DBP. The goal of this 
procedure is to isolate a polypeptide that binds DNA but 
3 5 that does not directly exploit the dyad symmetry of DNA. 
The binding is most likely in the major groove, but the 
present invention is not limited to polypeptides that bind 
in the major groove* The selections are perfozrmed using a 
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non-syrtunetric target to avoid isolation of novel dimers 
that support two symmetrically related copies of the 
original recognition elements • 

5 The non-variable regions of the HIV-1 genome, as 

listed in Example 1, were searched using a half operator 
from the arc operator as search sequence. 

We sought subsequences in the non-variable sequences 

10 of the HIV-1 genome that match either half of the consensus 
P22 arc operator shown in Table 200 • Subsequences that are 
closer to the start of transcription are preferred as 
targets because proteins binding to these subsequences will 
have greater effect on the transcription of the genes* No 

15 sequence was found that matched all six unambiguous bases; 
the subsequences at 1024, 1040, and 2387 (shown in Table 
203) each have a single mismatcho Iiower case letters in 
the " arcO sequence indicate ambiguity in the P22 arc 

operator sequence « Lower case, bold, underscored letters 

20 in the HIV-1 subsequences indicate mismatch with the 
consensus arc operator • Two other subsequences, shown in 
Table 2 03, have one mismatch at one of the conserved bases 
and one mismatch with one of the ambiguous bases. The 
HIV-1 subsequence that starts at base 1024 is chosen as a 

25 target sequence. We replace the 3' ten bases of the arc 
operator with the 3 ' ten bases of this subsequence to 
produce the hybrid target sequence: 

ATGATAGAAG | C | GCAACCCTC • 

30 * 
We insert this sequence into the promoter that regulates 
tet in pEP2002 by ligating dsDNA composed of an equimolar 
mixture of olig#440 and olig#442 into the Stu l/ Hind lll site 
of pEP2002* Sxibstitution of the arc operator by the arc- 

35 HIV-1 hybrid sequence relieves the repression by Arc* The' 
construction is called pEP2 003- and confers Tc^/- Ap^, Gal^ 
on delta4 cells o ' 
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First Target and P^eo that promotes iefe 

5' |CCTiGCG|AACjCGG|AAT|TGC|CAGj- 
Olig#440 = 3' gga cgc ttg gcc tta acg gtc- 
I gtuX I I -35 I 

I CTG I GGG I CGC | CCT | CTG | GTA | AGG | TTG | - 
gac ccc gcg gga gac cat tec aac- 

I -^0 I 

ATA ATA GAG TAg caa ccc tct = HIV 1024-1044 
I GGA I ATG I ATA | gA^ | GCg | caa | ccc | tct | A 3 • =01ig#442 

cct tac tat ctt cgC GTT GGG AGA t teg a 5' 
-I First Target I [Hind3 I 



The second instance of the target is engineered 
like manner, using pEP2003 first digested with Apa l and 
Zbal and then ligated to annealed olig#444 and olig#446. 
The plasmid is called pEP2 004 and confers Gal+, tc^, Ap^^ on 
20 HBlOl cells. The plasmid pEP2004 contains the first target 
sequence in both selectable genes and is ready for intro- 
duction of a variegated pdbp gene. 



First Target and P^mp that promotes aalT.K 

5' lCTT|CTAj AAT|ACA|TTC|AAA| 

01ig#444 3» c egg gaa gat tta tgtjaag ttt| 

I Apaj: I I -35 I 



I TAT I GTA I TCC | GCT [ CAT | GAG | ACA | ATA | ACC | CT- 
ata cat agg cga gta etc tgt tat tgg ga 

I -10 I 



T I ATG I ATA I GAA I GCg I caa I ccc I tct I CGT 3'01ig#44 6 

tat ctt cgc GTT GGG AGA gca gat c 5 
J ; First Target I | Xbal 



: I 
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The variegated DNA for a 36 amino acid polypeptide is 
shown in Table 2 04. This DNA encodes the first ten amino 
acids of P22 Arc followed by 26 amino acids' chosen to be 
5 likely to form extended structures. In Table 204, we 
indicate variegation at one base by using a letter, other 
than A, C, G, or T, to represent a specific mixture of 
deoxynucleotide substrates. The range of amino acids 
encoded is written above the codon numbers 

10 

I|M 

I 111 
I ATs I 

15 

indicates that the first base is synthesized with A, the 
second base with T, and the third base with a mixture of C 
and G, and that the resulting DNA could encode amino acids 
I or M. That the parental protein has isoleucine at 

2 0 residue 11 is indicated by writing I first o Residues 22 

and 23 are not variegated to provide a homologous overlap 
region so that olig#420 and olig#421 can be annealed • 
After olig#420 and olig#421 are annealed and extended with 
Klenow fragment and all four deoxynucleotide triphosphates, 
25 the DNA is digested with both BstE II and Bsu3 6 I and ligated 
into pEP2004 that has also been digested with £stEII and 
BSU36 I. The ligated DNA, denoted vgl-pEP2004, is used to 
transform Delta4 cells. After an appropriate grow out in 
the presence of IPTG, the cells are selected with fusaric 

3 0 acid and galactose. 

By hypothesis, we recover ten colonies that are Gal^ 
and Fus^o We sequence the plasmid DNA from each of these 
colonies. A hypothetical DBP amino acid sequence from one 
3 5 of these colonies is shown in Table 205. "^'''^ 

Comparison of the amino-acid sequence's of different 
isolates may provide useful information on which residues 
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play crucial roles in DNA binding. Should a residue 
contain the same amino acid in most or all isolates, we 
might infer that the selected amino acids is preferred for 
binding to the target sequence. Because we do not know 
that all of the isolates bind in the same manner, this 
inference must be considered as tentative. Residues closer 
to the unvaried section that have repetitive isolates 
containing the same amino acid are more informative than 
residues farther away. 



10 



20 



In a second round of Diffuse Mutagenesis, we vary the 
codons shown in Table 2 06. Residues 1 through 10 are not 
varied because these provide the best match for the first 
ten bases of the target. Residues 19, 20, and 21 are not 
15 varied so that the synthetic oligonucleotides can be 
annealed. The two-way variations at residues 11 through 18 
and 23 through 36 all allow the selected amino acid to be 
present, but also allow an as-yet-untested amino acid to 
appear. It is desirable to introduce as much variegation 
as the genetic engineering and selection methods can 
tolerate without risk that the parental DBF sequence will 
fall below detectable level. Having picked three residues 
for the homologous overlap, we have only 23 amino acids to 
vary. Thus residue 22 is varied through four possibil- 
ities instead of only two. Residue 22 was chosen for 
four-way variegation because it is next to the unvaried 
residues. We use pEP2004 as the backbone, and ligate DNA 
prepared with Klenow fragment from oligonucleotides #4 23 
and #424 (Table 206) to the IstEli and Bsu36I sites. The 
resulting population of plasmids containing the variegated 
DNA is denoted vg2-pEP2004. 



25 



30 



Table 2 07 shows the amino acid sequence obtained from 
a hypothetical isolate bearing a DBP gene specifying a 
35 polypeptide with improved affinity for the target. Changes 
in amino acid sequence are observed at ten positions. 
Comparisons of the sequences from several such isolates as 
well as those obtained in the first round of mutagenesis 
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%oan be used to locate residues providing significant DNA- 
binding energy. 

Having established some affinity for the target, we 
5 now seek to optimize binding via a more focused mutagenesis 
procedure. Table 2 08 shows a third variegation in which 
twelve residues in the variad^le region are varied through 
four amino acids in such a way that the previously selected 
amino acids may occur. Again^ pEP2004 is used as backbone 

10 and synthetic DNA having cohesive ends is prepared from 
olig#325 and olig#327. The plasmid is denoted vg3-pEP2004. 
;in subsequence variegation, we would vary other residues 
through four amino acids at one time. By hypothesis, we 
.select the polypeptide shown in Table 209 that has high 

15 specific affinity for the first target; now we can: 

a) replace both occurrences of the first target by a 
second target, i.e. the intact HIV-1 sxibsequence 
(1024-1044) , and 

2 0 

b) use the selected polypeptide as the parental DBP to 
generate a variegated population of polypeptides from 
which we select one or more that bind to the second 
target. 

25 

Because the second target differs from the first in the 
region thought to be bound by residues 1 through 10 of the 
parental DBP, we concentrate our variegation within these 
residue for the first several rounds of variegation and 

3 0 selection. 

We replace the target DNA sequence in the neo promoter 
for tet in pEP2002 with - ds : DNA comprising synthetic 
olig#450 and olig#452 at the Stul/Hiiidlll site. The new 
35 plasmid is named pEP2010 and confers Tc^ on delta4 cells; V 
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Second Target and Pneo that promotes tet 

5 • I CCT I GCG I AAC | CGG | AAT | TGC | CAG | - 
Olig#4 50 = 3' gga cgc ttg gcc tta acg gtc- 
5 I Stui I I -35 I 

I CTG I GGG I CGC | CCT | CTG | GTA ( AGG | TTG | GG- 
gac ccc gcg gga gac cat tec aac cc- 

I -10 I 



10 



15 



20 



25 



ATA ATA CAG TAg caa ccc tct = HIV 1024-1044 
A|ATa|ATA|cAg|tag|caa|ccc|tct|A 3 •01ig#452 

t taT tat GtC ATC GTT GGG AGA t teg a 5' 
J Second Target I I Hind3 [ 

We replace the target in the amp promoter for aalT.K 
of PEP2010 with synthetic olig#454 and olig#456 between 
Aeal and 2^bal sites. The new plasmid is named pEP2011 and 
confers Gal+ on HBIOI. pEP2011 contains the second target 
in both selectable genes and is ready for introduction of a 
variegated pdbp gene and selection of cells expressing 
polypeptides that can selectively bind the target DNA 
subsequence. 

Second Target and P^^p that promotes aalT.K 

5 ' I CTT I CTA [ AAT | ACA | TTC | AAA | 

01ig#4 54 3' c cgg gaa gat tta tgt | aag ttt| 

I Apal I I -35 I 



30 



35 



I TAT I GTA I TCC | GCT | CAT | GAG | ACA | ATA | AGO | CT 
ata cat agg cga gta etc tgt tat tgg ga 

I -10 I 
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. ATA ATA CAG TAg caa ccc tct = HIV 1024-1044 
T| ATa| ATA| cAg| tag| caa| ccc| tct] CGT 3 '01ig#456 

a taT tat GtC ATc GTT GGG AGA gca gat c 5« 
J Second Target L | Xbal [ 

5 

Variegation of the first eleven residues of the 
potential DNA-binding polypeptide is illustrated in Table 
210. Doxible-stranded DNA having appropriate cohesive ends 
is prepared from olig#460 and olig#461, Klenow fragment, 

10 BstEII, and Bsu3 6 I, This DNA is ligated into similarly 
digested backbone DNA from pEP2 011y the resulting plasmid 
is denoted vgl-pEP2011o Delta4 cells are transformed and 
selected with fusaric acid and galactose. Table 211 shows 
the sequence of a 37 amino-acid polypeptide isolated from 

15 cells exhibiting the DBP"^ phenotypes by the above hypothe- 
tical selection. The sequence shown in Table 211 is 
hypothetical and is given by way of example. This example 
must not be construed as a prediction that this sequence 
has specific affinity for the target or any other DNA 

20 sequence. Further variegation (vg2, vg3 , . - . ) of this 
peptide and selection for binding to Target#2 will be 
needed to obtain a peptide of high specificity and affinity 
■for Target#2. 

25 We anticipate that Successful DBP production will take 

more than three or four cycles of variegation and selec- 
tion, perhaps 10 or 15. We anticipate that initial phases 
will require careful adjustment of the selective agents and 
IPTG because the level of repression afforded by the best 

3 0 polypeptide may be quite low. As stated, we expect that 
biophysical methods, such as X-ray diffraction or NMR, 
applied to complexes of DNA and polypeptide will yield 
important indications of how to hasten the forced evolu- 
tion* 



35 



The length of the polypeptide in the example may not . 
be optimal; longer or shorter polypeptides may be needed i. 
It may be necessary to bias the amino acid composition more'/ 
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toward basic amino acids in initial phases to obtain some 
non-specific DNA binding. Inclusion of numerous aromatic 
amino acids (W,F,Y,H) may be helpful or necessary. 

Other strategies to obtain polypeptides that bind 
sequence-specifically are illustrated in examples 3, 4, 
and 5 . 



20 



Example 3 

We present a second example of the application of our 
selection method applied to the generation of asymmetric 
DBFs. A possible problem with making and using DNA-binding 
15 polypeptides, is that the polypeptides may be degraded in 
the cell before they can bind to DNA. That polypeptides 
can bind to DNA is evident from the infoirmation on se- 
quence-specific binding of oligopeptides such as Hoechst 
33258. Polypeptides composed of the 20 common natural 
amino acids contain all the needed groups to bind DNA 
sequence-specifically. These are obtained by an efficient 
method to sort out the sequences that bind to the chosen 
target from the ones that do not. To overcome the tendency 
of the cells to degrade polypeptides, we will attach a 
domain of protein to the variegated polypeptide as a 
custodian. The first example of a custodial domain 
presented is residues 20-83 of barley chymotrypsin inhib- 
itor. 

30 The strategy is to fuse a polypeptide sequence to a 

stable protein, assuming that the polypeptide will fold up 
on the stable domain and be relatively more protected from 
proteases than the free polypeptide would be. If the 
domain is stable enough, then the polypeptide tail will 

35 form a make-shift structure on the surface of the stable 
domain, but when the DNA is present, the polypeptide tail 
will quickly (a few milliseconds) abandon its former 
protector and bind the DNA. The barley chymotrypsin 



25 
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inhibitor (BCI-2) is chosen because it is a very stable 
domain that does not depend on disulfide bonds for stabil- 
ity. We could attach the variegated tail at either end of 
BCI-2 o A preferred order of amino acid residues in the 
5 chimeric polypeptide is: a) methionine to initiate transla- 
tion, b) BCI-2 residues 20-83, c) a two residue linker, d) 
the first ten residues of Arc, and e) twenty-four residues 
: that are varied over two amino acids at each residue. The 
linker consists of G-Ko Glycine is chosen to impart 
10: flexibility o Lysine is included to provide the potentially 
important free amino group formerly available at the amino 
terminus of the Arc protein. The first target is the same 
as the first target of Example 2. 

15 Table 300 shows the sequence of a gene encoding the 

required sequence. The ambiguity of the genetic code has 
, been resolved to create restriction sites for enzymes that 
do not cut PEP1009 outside the rav gene. This gene could 
be synthesized in several ways, including the method 

2 0 illustrated in Table 3 01 involving ligation of oligonucleo- 

. tides 470-479. Plasmid pEP3000 is derived from pEP2004 by 
V- replacement of the arc gene with the sequence shown in 
Table 3 00 by any appropriate method. 

25 Table 302 illustrates variegated olig#480 and olig#481 

that are annealed and introduced into the CI2-arc f 1-10) 
gene between PpuM I and Kpn l to produce the plasmid popula- 
tion vgl-pEP3000. Cells transformed with vgl-pEP3000 are 
selected with fusaric acid and galactose in the presence of 

3 0 IPTG. Further variegation (vg2 , vg3 , ) will be required 

to obtain a polypeptide sequence having acceptably high 
specificity and affinity for Target#l. 

.35 Example 4 

We ^present a .second strategy involving . .a , polypeptide 
chain attached to a custodial domain. In this strategy. 
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the custodial domain contains a DNA-recognizing element 
that will be exploited to obtain quicker convergence of the 
forced evolution. 

5 The three alpha helices of Cro fold on each other. 

It has not been observed that these helices fold by them- 
selves, but no efforts in this direction have been report- 
ed. We will attach a variegated segment of 24 residues to 
residue 35 of Cro (H35 is the last residue of alpha 3). 
The target will be picked to contain a good approximation 
to the half Or3 site at one end but no constraint is placed 
on the bases corresponding to the dyad-related other half 
of Or3. a sequence that departs widely from the Or3 
sequence is actually preferred, because this discourages 
15 selection of a novel dimeric molecule. We assume that 
alpha-3 forms and binds to the same four or five bases that 
it binds in Or3 and that a polypeptide segment attached to 
the carboxy terminus of alpha-3 can continue along the 
major groove. We attach 24 amino acids of polypeptide 
immediately after the last residue of alpha-3, wherein the 
polypeptide is chosen: a) to have more positive charge than 
negative charge, b) to have beta chain predominate, c) to 
have some aromatic groups, and d) to have some H-bonding 
groups, produces a population that is then cloned and host 
cells are selected for expression of a polypeptide that 
binds preferentially to the target sequence. 



20 



25 



We first construct a hybrid target sequence (Target 
#3) containing one Or3 half-site fused to a portion of the 

3 0 final target. This hybrid target DNA subsequence is 
inserted into the selectable genes in the same manner as 
the arc operator was inserted in Example 2. We then follow 
the same procedure to vary the 24 residues; first we vary 
twenty-four residues, using two possible amino acids at 

35 each residue. We carry out two or more cycles of such 
diffuse variegation. Then we vary 12 residues, using 4 
possible amino acids at each residue. We do two or more 
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iterations of this process so that all residues are varied 
at least once. 

We have now generated one or more DBFs that bind well 
5 to one half of the final target sequence. Next we generate 
binding to the other half of the final target. First we 
replace both instances of Target #3 with the final target 
sequence, target #4. We then vary the alpha helix 3 and 
the surface of the hypothesized domain formed by helices 1- 
10 3 to optimize binding to final target sequence. 

A search of the non-variable regions of the HIV-l 
genome reveals that bases 624-640 (aATCtCTAGCAGTGGCG) 
contain a good match to one half of 0^3 , as shown in Table 
15 400. As first target of this example, we choose 

TATCCCTAGCAGTGGCG , 

denoted Target#3, that has one half of Oj^3 and nine bases 
2 0 from Hiy-lo Once a sequence is obtained that binds 
Target#3, we replace Target#3 by Target#4 = HIV 624-640 and 
variegate the recognition helices taken from CrOo 

To engineer Target#3 into P^eo ^^^^ regulates tet . 
25 plasmid pEP2002 is digested with StuI and Hindlll and the 
purified backbone is ligated to an annealed, equimolar 
mixture of olig#490 and olig#492, Delta4 cells are trans- 
formed and selected with Tc; replacement of the arc 
operator relieves the repression by Arc. Plasmid DNA from 
Tc^ colonies is sequenced to confirm the construction; the 
construction is called pEP4000. 



30 



Target #3 and P^eo that promotes tet 

35 - 5 ' I CCT I GCG | AAC | CGG | AAT | TGC | CAG , ^£ 

-Olig#490^= 3' gga cgc ttg gcc tta acg gtc- s^cr: . 
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I CTG I GGG I CGC | CCT | CTG | GTA | AGG | TTG | GG- 
gac ccc gcg gga gac cat tec aac cc- 

I -10 L 

5 aAT etc TAG CAG TGG CG = HIV 624-640 

A I TAT I CCC I TAG | CAG | TGG | CGA 3 • 01ig#492 

t ata ggg ate gte ace get teg a 5 • 
J Target #3 j | Hind3 | 

10 

We engineer the second instance of the target, in 
like manner, into Pamp for galT.K . using Aoa l and Xba l to 
digest pEP4000 and olig#494 and olig#496. HBlOl cells 
(qalK~) are transformed and are selected for ability to 
15 grow on galactose as sole carbon source. Plasmid DNA from 
Gal+ colonies is sequenced in the region of the insert to 
confirm the construction. The plasmid is called pEP4 0 01. 



20 Target #3 and P^j^p that promotes galT.K 

5 ' I CTT I CTA I AAT | ACA | TTC | AAA | 

01ig#494 3' c egg gaa gat tta tgtjaag ttt| 

I Apal I I -35 I 

25 



I TAT I GTA I TCC | GCT | CAT | GAG | ACA | ATA | ACC | 
ata cat agg cga gta etc tgt tat tgg 

I -10 L 

30 

I CTT I TAT I CCC I TAG I CAG I TGG I CG CGT 3'01ig#496 
gaa ata ggg ate gte ace ge gea gat c 5 • 
J Target j>3 | [ Xbal [ 

35 A gene fragment encoding the first two helices of Cro 

is shown in Table 401. 01ig#483 and olig#484 are synthe- 
sized and extended in the standard manner and the DNA is 
digested with BstEII and KpnI. This DNA is ligated to 
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;backbone from pEP4001 that has been digested with BstEII 
and Kpn l ; the resulting plasmid, denoted pEP4002, contains 
the Targets 3 subsequence in both selectable genes and is 
ready for introduction of a variegated pdbp gene between 
5 Bglll and Kpn l, Table 402 shows a piece of vgDNA prepared 
to be inserted into the Bal ll- Kpn l sites of pEP4002. Table 
4 03 shows the result of a selection of delta4 cells 
transformed with vgl-pEP4002, with fusaric acid and 
^galactose in the presence of IPTG, Additional cycles of 
10 variegation of residues 36-61 are carried out in such a way 
.that the amino acid selected at the previous cycle is 
included- After several cycles in which 22-24 residues are 
varied through two possible amino acids, we choose 10-13 
..amino acids and vary them through four possibilities. 

Once reasonably strong binding to Target #3 is ob- 
.rtained, we replace Target#3 with Target#4 and vary the 
residues in helix 3 (residues 26-35) and, to a lesser 
■extent, helix 2 (residues 16-23) . 

20 

Example 5 

We disclose here a method of engineering a polypeptide 

25 extension onto the amino terminus of P22 Arc, a natural 
DBP, so that the novel DBP develops asymmetric DNA-binding 
specificity for a subsequence found in the HIV-1 genome. 
Others have observed that loss of arms from natural DBPs 
may cause loss of binding specificity and affinity (PAB082a 

3 0 and ELIA85) , but none, to our knowledge, have suggested 
adding arms to natural DBPs in order to enhance or alter 
specificity or affinity. The new construction is denoted a 
"polypeptide extension DBP"; the gene is denoted ped and 
the proteins are denoted Ped. Wild- type Arc forms dimers 

35 and binds to a partially, palindromic, operator. We will 
, generate va,.. sequence of DBPs descendent from Arc, Early 
members - .of this family . will form .dimers , . , but . will have 

; sufficient -binding area - such ^that asymmetric .targets will 
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be bound. In final stages of the development, proteins 
that do not dimerize will be engineered. 



Table 200 shows the symmetric consensus of left and 
5 right halves of the P22 arc operator, arcO , Table 500a 
shows a schematic representation of the model for binding 
of Arc to arcO that is supported by genetic and biochemical 
data (VERS 8 7b) . Arc is thought to bind B^DNA in such a way 
that residues 1-10 are extended and the amino terminus of 
10 each monomer contacts the outer bases of the 21 bp operator 
(RT Sauer, public talk at MIT, 15 September 1987) . 

Arc is preferred because: a) one end of the polypep- 
tide chain is thought to contact the DNA at the exterior 

15 edge of the operator, and b) Arc is quite small so that 
genetic engineering is facilitated. P22 Mnt is also a good 
candidate for this strategy because it is thought that the 
amino terminal six residues contact the mnt operator, mnto , 
in substantially the same manner as Arc contacts arcO. Mnt 

20 has significant (40%) sequence similarity to Arc (VERS87a) . 
Mnt forms tetramers in solution and it is thought that the 
tetramers bind DNA while other forms do not. When the mnt 
gene is progressively deleted from the 3' end to encode 
truncated proteins, it is observed that proteins lacking 

2 5 K79 and subsequent residues have lowered affinity for mnto 

and that proteins lacking Y78 and subsequent residues can 
not form tetramers and do not bind DNA sequence-specif ical- 
ly (KNIG88). Some truncated Mnt proteins of 77 or fewer 
residues form dimers, but these dimers do not present the 

3 0 DNA-recognizing elements in such a way that DNA can be 

bound. Arc is preferred over Mnt because Arc is smaller 
and because Arc acts as a dimer. 



Other natural DBFs that have DNA-recognizing segments 
3 5 thought to interact with DNA in an extended conformation 
(referred to as arms or tails) and thought to contact the 
central part of the operator, such as X Cro or X cl 
repressor, are less useful. For these proteins to be 
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lengthened enough to contact DNA outside the original 
'operator, several residues would be needed to span the 
space between the central bases contacted by the existing 
teirminal residues and the exterior edge of the operator. 

5 

Table 500a illustrates interaction of Arc dimers with 
arcO; the two "C"s of Arc represent the place, near residue 
FIO, at which the polypeptide chain ceases to make direct 
-contact with the DNA and folds back on itself to form a 

10 globular domain, as shown in Table 500b and Table 500c. 
Which of these alternative possibilities actually occurs 
has not been reported » Our strategy is compatible, with 
some alterations, with either structure. In Table 500b, 
each set of residues 1-10 makes contact with a domain 

15 composed of residues 11-57 of the same polypeptide chain; 
the dimer contacts are near the carboxy terminus. Table 
500c shows an alternative interaction in which residues 1- 
10 of one polypeptide chains interact with residues 11-57 
of the other polypeptide chain; the dimer. contacts occur 

20 shortly after residue 10. The similarity of sequences of 
Arc and Mnt, the demonstration of function of DNA-recogniz- 
ing segments transferred from Arc to Mnt (RT Sauer, public 
vtalk at MIT, 15 September 1987 and Knight and Sauer cited 
in VERS86b) , and, the behavior of Mnt on truncation suggest 

25 that Table 500b is the correct general structure for Arc,- 
but the stmcture diagrammed in Table 500c is also pos- 
sible. 

Table 501 shows the four sites at which one of the 
3 0 consensus arc half operators comes within one base of 
matching ten bases (six unambiguous and four having two- 
fold ambiguity) in the non-variable segments of HIV-1 DNA 
sequence, as listed in Example 1. The symbol "@" marks 
base pairs that vary among different strains of HIV-1. 
35 Because we. intend to extend Arc from its amino terminus, we 
seek subsequences of HIV-1 that: a) match one. of the arc 
half operators, and b) have non-variable sequences located 
so that an . aminq-terminal extension of the Arc protein will 
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interact with non-variable DNA. The subsequences 1024-1033 
and 4676-4685 meet this requirement while the subsequences 
at 1040-1049 and 2387-2396 do not. In the case of 1040- 
1049, the amino-terminal extension would proceed in the 3' 
5 direction of the strand shown and would reach variable DNA 
after two base pairs. For 2387-2396, variable sequence is 
reached at once. The subsequence 1024-1033 is preferred 
over the subsequence 4 676-468 5 because it is much closer to 
the beginning of transcription of HIV so that binding of a 
protein at this site will have a much greater effect on 
transcription. In the remainder of this example, positions 
within the target DNA sequence will be given the number of 
the corresponding base in HIV-i. Base A1034 of HIV-i is 
aligned with the central base of arcO. 



10 



15 



20 



HIV 1024-1044 has only three bases in each half that 
are palindromically related to bases in the other half by 
rotation about base pair 1034: A1024/T1044. Aio26/Ti042^ 
and Gio32/Cio3 6- The latter two base pairs correspond to 
positions in arcO that are not palindromically related. 
Five of the six palindromically related bases of arcO 
correspond to non-palindromically related bases in HIV 
1024-1044. Thus no dimeric protein derived from Arc is 
likely to bind HIV 1016-1046 if symmetric changes are made 
25 only in the residues 1-10 (or in any other set of residues 
originally found in Arc) . Our strategy is to add, in 
stages, eleven variegated residues at the amino terminus 
and to select for specific binding to a progression of 
targets, the final target of the progression being bases 
30 1016-1037 of HIV-1. Because the region of protein-DNA 
interaction is increased beyond that inferred for wild-type 
Arc-arcO complexes, unfavorable contacts in bases aligned 
with the right half of arcO can be compensated by favorable 
contacts of the polypeptide extension with bases 1016-1023. 
3 5 The penultimate selection isolates a dimeric protein that 
binds to the HIV-l target 1016-1037; the ultimate selection 
isolates a protein that does not dimerize and binds to the 
same target. 
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' Table 502 shows a progression of target sequences 

that leads from wild-type arcO to HIV 1016-1037. It is 
emphasized that finding a subsequence of HIV-1 that has 
5 high similarity to one half of arcO is not necessary; 
rather, use of this similarity reduces the number of steps 
needed to change a sequence that is highly similar to arcO 

- into one that is highly similar or identical to an HIV-1 

- subsequence. Reducing the number of steps is useful, 
10 'because, for each change in target, we musts a) construct 

• plasmids bearing selectable genes that include the target 
^'sequence in the promoter region, b) construct a variegated 
'population of ped genes, and c) select cells transformed 
with plasmids carrying the variegated population of ped 
15 genes for DBP"^ phenotype* 

In sections (a), (c) , (e) , and (g) of Table 502, 
bases in the targets are in upper case if they match HIV 
"1016-104 6 and are underscored if they match the wild-type 

2 0 arcO sequence. 

We construct a series of plasmids, each plasmid 
"^containing one of the target sequences in the promoter 
region of each of the selectable genes. For each target, 
25 we variegate the ped gene and select cells for phenotypes 
dependent on functional DBFs. For each target, several 
rounds of variegation and selection may be required. We 
anticipate that a plurality of proteins will be obtained 
from -independent isolates by selection for binding to one 

3 0 target. We pick the protein that shows the strongest in 

vitro binding to short DNA segments containing the target 
as the parental Ped to the next round ' of variegation and 
selection. Genetic methods, such as generation of point 
mutations in the ped gene or in the target and selection 
35 for function or non- function of Ped can be used to deter- 
mine associations between particular bases and particular 
^ ' residues' ^(VERS86b) -^^ ^ ^-'-^ "~ r^^c : ■ 
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Once a Ped with specific binding for the target is 
obtained, it may be useful to determine a 3D structure of 
the Ped-DNA complex by X-ray diffraction or other suitable 
means. Such a structure would provide great help in 
5 choosing residues to vary to improve binding to a given 
target or to an altered target. 

We initiate development of a polypeptide extension 
DBP having affinity for HIV 1016-103 7 by generating a 
variegated population of Peds and selecting for binding to 
the first target. Table 502a shows the first target which 
we designed to have identity to arco in the left half, but 
to have a mismatch (arcO vs. target) at Aio38 (which is C 
in the corresponding position in the right half of arco and 
15 is palindromically related to a G in the left half) ; the 
rationale is as follows. Vershon et al. (VERS87b) report 
that chemical modification with dimethyl sulfate of the 
wild-type CG at this location interferes mildly with 
binding of Arc and that this location is strongly protected 
from modification by dimethyl sulfate if Arc is bound to the 
operator. Thus we expect a mismatch between wild-type arcO 
and the first target at Aio38 to make wild-type Arc bind 
poorly- Binding can be restored, however, by favorable 
contacts to bases 1021-102 3 by the amino-terminal exten- 
25 sion. 



20 



An alternative first target would have Cio38/ as does 
arco at the corresponding location, and i^xOAlf unlike arcO 
or HIV-i. Vershon et al^ (VERS87b) report that methylation 
of the corresponding CG base pair strongly interferes with 
binding of Arc. Thus, changing the base that corresponds 
to HIV 1041 should have a strong effect on binding of Arc 
to the alternative target. 

In the first variegation step, we extend Arc by five 
variegated residues at the amino terminal. Since five 
residues can contact no more than three bases in a se- 
quence-specific manner, we limit the extent of the target 
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to those bases that correspond to HIV 1021-1044. Inclusion 
of bases corresponding to HIV 1016-1020 at this initial 
"stage might position the target too far downstream from the 
promoters of the selectable genes to allow strong repres- 
5 -sion of these promoters. Once a Ped displaying binding to 
bases corresponding to 1021-1044 has been isolated, we can 
introduce a greater length of the HIV-1 sequence into the 
left side of the target without concern that the Ped will 
;bind too far downstream from the promoter of the selectable 
10 :genes to block transcription- Furthermore, once binding by 
the amino terminal extension has been established, we can, 
in a stepwise manner, remove the right half of arcO from 
the target, thereby forcing more asymmetric binding to the 
left half of arcO and the bases upstream of 1024. 

15 

The first target is engineered into both selectable 
genes as in Example 2. We use olig#501 and olig#502, shown 
in Table 503, to introduce the first target downstream of 
-^^neo that promotes tet, replacing arcO in pEP2002; the 
20 resulting plasmid is called pEP5000. From pEP5000, we use 
olig#503 and olig#504 to construct pEPSOlO in which the 
first target replaces arcO downstream of Pamp that promotes 
galT .K. 

Table 502b shows schematically how the amino terminal 
residues align to the first target; the five residue 
extension is unlikely to contact more than 3 base pairs 
upstream from base 1024. The alteration in the right half 
operator prevents tight binding unless the additional 
residues make favorable interactions upstream of 1024. 
Care is taken in designing the two instances of the target 
that the downstream boundaries are different, AAG in P^eo 
and CGT in Pamp- Thus, for the novel DBP to bind specifi- 
cally to both instances of the target, it must recognize 
the common sequence upstream of base 1024 • ^ ^ 

An initial - variegated ped - is constructed using 
olig#605, as shown in ' Table 504, and comprises: a) a 



20 
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methionine codon to initiate translation, b) five vari- 
egated codons that each allow all twenty possible amino 
acids, and c) the Arc sequence from 101 to 157, (Because 
we are constructing a polypeptide extension at the amino 
terminus, we have added 100 to the residue numbers within 
Arc so that Arc residue 1 is designated 101.) This vari- 
egated segment of DNA comprises (2^)5 = 2^5 = 3.2 x 10*7 
different DNA sequences and encodes 20^ = 3.2 x 10^ 
different protein sequences; with the given technical 
capabilities, we can detect each of the possible protein 
sequences. The 3' terminal 20 bases of olig#605 are 
palindromically related so that each synthetic oligo- 
nucleotide primes itself for extension with Klenow enzyme. 
The DNA is then digested with Bsu3 6I and BstEII and is 
ligated to the backbone of appropriately digested pEPSOlO 
which bears the first target in each selectable gene. 
Transformed delta4 cells are selected for Fus^ Gal^ at low, 
medium, and high concentrations of IPTG, the inducer of the 
^aoW5 promoter that regulates ped. Because the first 
target is quite similar to arco . we anticipate that a func- 
tional Ped will be isolated with low-level induction of 
the ped gene with IPTG. 



10 



15 



25 



More than one round of variegation and selection may 
be required to obtain a Ped with sufficient affinity and 
specificity for the first target. Function of a Ped is 
judged in comparison to the protection afforded by wild- 
type Arc in cells bearing pEP2002. Specifically, strength 
of Ped binding is measured by the IPTG concentration at 
30 which 50% of cells survive selection with a constant 
concentration of galactose or fusaric acid, chosen as a 
standard for this purpose. A Ped is deemed acceptable if 
it can protect cells against the standard concentrations of 
galactose and fusaric acid, administered in separate 
35 tests, with an IPTG concentration of 5 x lO""* M. Prefer- 
ably, a Ped can protect cells against the standard concen- 
trations of galactose and fusaric acid, tested separately, 
with no more than ten times the concentration of IPTG 
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needed by pEP2002-bearing cells. Variegation of residues 
101 , 102 , and others may be needed. We anticipate that a 

-plurality of independent functional Peds will be isolated; 

.we discriminate among these by measuring in vitro binding 
5 to DNA oligonucleotides that contain the target sequence. 
The amino-acid sequences of different isolates are com- 
pared; residues that always contain only one or a few kinds 

,.^of amino acids are likely to be involved in sequence- 

^..specific DNA binding. Table 505 shows a hypothetical 
10 ;;.isolate, Ped-6, that binds the first target • 

Table 502c shows the changes between the first target 
...and the second target. Three changes are made left of 
;>center to make the target more like HIV 1016-1042. Only 
15 the change G±o2o~>^ affects a base that is palindromically 
related in arcQ . One change is made right of center that 
makes the target more like HIV 1016-1042, less like arcO . 
and less palindromically symmetric. Furthermore, the 
.^target is shortened on the right by two bases so that 
2 0 ..selection isolates proteins that bind asymmetrically to the 
left side of the target. Starting with pEP2002, we 
introduce, in two genetic engineering steps that use 
40lig#541, olig#542, olig#543 and olig#544 (Table 506), the 
second target (in place of arcO) into the promoter region 

2 5 of each selectable gene; the resulting plasmid is denoted 

PEP5020. 

Table 507 shows a variegated sequence that is ligated 
into pEP5020 between BstEII and Bsu 36I. Variegated codons 

3 0 are shown in the same way as in Table 2 04. 

Table 502d illustrates that residues 100-110 of Ped-6 
contact the bases of the second target that differ from the 
first targeto Accordingly, residues 1 and 96-99 of Ped are 
35 not variegated: in. the DNA. shown in Table 507.? rather, 
residues 100-110 . are ^ each . varied through , four possib- 
ilities, always including, the ^ amino acid previously present 
at that- residue. This ^generates 4^^ = 2^^^^ = ^pprox. 4 x 
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10^ different DNA and protein sequences. Selection of 
transformed delta4 cells for Fus^ Gal^ and screening by in 
vitro DNA binding yields, by hypothesis, a plasmid coding 
on expression for the protein Ped-6-2, illustrated in 
5 Table 508. 

An alternative to the variegation shown in Table 507 
is one in which we vary residues 101-105, 108, and 110 
through eight possibilities each, yielding 2.0 x 10^ dna 

10 and protein sequences. These residues, except MlOl, are 
indicated to be in contact with the operator. MlOl has 
been altered by the attachment of the polypeptide extension 
and thus should be altered. After variegation of the 
listed residues and selection, further variegation should 

15 include some variegation of residues 9 6-103 because changes 
in the listed residues may change the context within which 
residues 96-103 contact the DNA. 

More than one round of variegation and selection may 
20 be required to obtain a Ped having sufficient affinity and 
specificity for the second target. 

Table 502e shows the changes from the second target to 
the third, which comprise: a) inclusion of bases 1018-1020, 

25 b) one change to the left of the 21 bp arcO region, c) two 
changes at the center of the arcO region, d) two changes 
left of center, and e) removal of bases 1041 and 1042. All 
of these changes make the third target less symmetric and 
more like HIV 1016-1040. The third target is introduced 

3 0 into each of the selectable genes in the same manner as the 
second target. The resulting plasmid, obtained in two 
genetic engineering steps, is denoted pEP503 0, Table 502 f 
shows that residues 96-110 are all potential sites to 
alter the specificity and affinity of DBPs derived from 

35 Ped-6-2. Thus, in Table 510, we illustrate a segment of 
variegated DNA that comprises 2^0 = 2.0^ DNA sequences and 
encoding on expression 10^ protein sequences having ten 
residues varied through two possibilities and five residues 
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through four possibilities. The DNA is then digested with 
Bst EII and Bsu3 6I and ligated into pEP5030- Transformed 
delta4 cells are selected for Fus^ Gal^o By hypothesis, we 
isolate a plasmid, denoted pEP5031, that codes on expres- 
5 sion for the protein Ped-6-2-5 shown in Table 509. 

Table 502g shows the changes between the third and 
fourth targets. The changes are: a) inclusion of bases 
1016-1017, b) two changes right of center, and c) removal 

10 of bases 1038-104 0. The initial variegation to be selected 
using the fourth target consists of an extension of six 
residues at the amino terminus of Ped-6-2-5, shown in Table 
511. In iterative ' steps of forced evolution of proteins, 
one should not produce a number of different DNA sequences 

15 greater than the number of independent transf ormants that 
one can obtain (about 10^ with current technology) . 
Because there are no residues corresponding to 90-95 in the 
parental DBP (Ped-6-2-5) , the first variegation and 
•selection with the fourth target is a non- iterative step 

20 and it is permissible to produce 10^^ DNA sequences and 6.4 
X 10^ protein sequences. In subsequent iterative rounds of 
variegation, the number of variants is, preferably, 
'-limited to a fraction, e.g. 10%, of the number of indepen- 
dent transf ormants that can be generated and subjected to 

25 selection. A protein, illustrated in Tiable 512 and denoted 
Ped-6-2-5-2, is isolated, by hypothesis, through selection 
of a variegated population of transformed cells for Fus^ 
Gal^. 

3 0 Ped-6-2-5-2 binds specifically to HIV 1016-1037 as a 

dimer. HIV 1016-1037 has no palindromic symmetry. 

Binding to an asymmetric DNA sequence by a dimeric protein 
is possible because the Ped-6-2-5-2 dimer has more recogni- 
tion elements than wild-type P22 Arc dimer and so can bind 

-35 even though nearly half of the right half of arcO has been 
removed from the target. Ped-6-2-5-2 is useful as is; 

' nevertheless, obtaining a monomeric protein 'may have 
advantages, including: a) higher affinity for the -target 



wo 90/07862 



PCT/US90/00024 



165 

because suboptimal interactions are eliminated, and b) 
lower molecular weight • Obtaining a functional monomer ic 
Ped is easiest if Arc dimers interact in the manner shown 
in Table 500b. We use the following steps to isolate a 
5 protein that binds specifically to HIV 1016-1037 as a 
monomer. 

Ped-6-2-5-2 is the parental DBF from which we derive 
the monomeric DBF. The route taken from a palindromically 
10 symmetric arcO sequence to an asymmetric HIV sequence was 
designed to select for binding to the left half of the 
original arc operator. 

Proteins that do not dimerize, but that bind speci- 
15 fically to the fourth target can be generated in several 
ways. Because the 3D structure of Arc is still unknown, we 
can not use Structure-Directed Mutagenesis to pick residues 
to vary to eliminate dimerization. One way to obtain 
monomeric proteins is to use diffuse mutagenesis to vary 
20 all residues from ill to 157 and select for proteins that 
can bind the target sequence. Another strategy is to 
synthesize the ped gene in such a way that numerous stop 
codons are introduced. This causes a population of 
progressively truncated proteins to be expressed. Table 
25 513 shows a segment of variegated DNA that spans the Bgl ll 
to Kfinl sites of the arc gene used throughout this example. 
This segment is synthesized with suitable spacer sequences 
on the 5» end. The extra "t" at the 3» end allows two such 
chains to prime each other for extension with Klenow 
3 0 enzyme. The ratios of bases in the variegated positions 
are picked so that each varied codon encodes about 35% of 
polypeptides to terminate at that position. Since we 
intend to determine how much the protein can be shortened 
and remain functional, we begin by replacing codon 153 with 
35 stop. Since 15 residues are varied, only about 0.3 % of 
chains will continue to stop codon 153 without one or more 
stop codons. All the intermediate length chains will be 
present in. the selection in detectable amount. delta4 
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cells transformed with pEP503 0 containing this vgDNA are 
selected for Fus^ Gal^- Because each variegated codon 
causes translation termination in about 3 5% of the genes in 
the variegated population, shorter coding regions are more 
5 abundant than longer ones. Thus, the shortest gene that 
encodes a functional repressor will be the most abundant 
gene selected, Plasmid DNA from a number of independent 
selected colonies is sequenced. The dimerization proper- 
ties of several functional DBFs are tested in vitro and the 
10 sequence of the shortest monomeric protein is retained for 
^ use and further study. 

In this manner, we generate a protein that binds 
monomer ically to a DNA sequence that has no palindromic 
15 symmetry. 

Example 6 

20 We illustrate here the fusion of two known DNA-binding 

domains to form a novel DNA-binding protein that recognizes 
an asymmetric target sequence. The progression of targets 
is the same as shown in Table 502 (Example 5) . The amino- 
acid sequence of the initial DBF is illustrated in Table 

25 600 and comprises the third zinc-finger domain from the 
product of the Drosophila kr gene (ROSES 6) , a short linker, 
and F22 Arc. The linker consists of three residues that 
are picked to allow: a) some flexibility between the two 
domains, and b) introduction of a Kpn l site. The polypep- 

3 0 tide linker should not allow excessive flexibility because 
this would reduce the specificity of the DBF. 

The primary set of residues to vary to alter the DNA- 
binding are marked with asterisks. Those in the zinc 
35 finger were picked by reference to the model of Gibson et 
al. (GIBS88) ; all residues having outward-directed side 
groups (except those directed upward from the -beta strands) 
were picked. Residues 101-110 (1-10 of Arc)'- were also 
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picked to be in the primary set. Other residues within the 
Arc sequence may be varied. For each target in the 
progression, we initially choose for variegation residues 
in the primary set that are most likely to abut that part 
5 of the target most recently changed. For example, for the 
first target, we begin by varying residues 21, 24, 25, 28, 
and 29, each through all twenty amino acids. After one or 
more rounds of variegation and selection, other residues in 
the primary and secondary set are varied o 

10 

Other zinc-finger domains, such as those tabulated by 
Gibson et al. (GIBS88) , are potential binding domains. 
Other proteins with known DNA binding, such as 434 Cro, may 
be used in place of Arc. Multiple zinc fingers could be 
15 added, stepwise, to obtain higher levels of specificity and 
affinity. 
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Table 1 MISSENSE MUTATIONS IN X CRC 



• 10 



MEQRITLKDYAMRF 



/ A / A / M alpha 1 I 

beta 1 II 1 I 

I R D (L) 

• ■ s ■ - • 



15 20 25 30 35 

• • o • o 

GQTKTAKDLGVYQSAINKAIHAGR 

(K)R 

(F)C A N 

E P N L R Q (T) L 

R H D H N T (R)K) Q 

l-J l-J-J I l_l I 

-| alp ha 2 I I alpha 3 j_ 

II I I I II 

P T HA L T T 

V F S V 

R T G 

P K 
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40 45 50 55 60 65 

KIFLTINADGSVYAEEVKPFPSNKKTTA 

G R N T 

Y A A Q V S 

T H(E)N (K)K N C L 



/ \ / \ / \ / \ / \ / \ / \ /. 

beta 2 beta 3 

III II 

F F(A) G T 

S V 

L S 
M 



Notes: 

Substitutions occuring at solvent exposed positions in the 

unbound repressor dimer are shown above the wild type 
sequence . 

Substitutions occuring at internal positions are shown below 

the wild type sequence. 
Subsitutions that produce repressor dimers with normal or 

nearly normal DNA binding affinities are shown in 

parentheses . 
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Table 2 : Examples of selections for plasiuid uptake and 

maintenance in E. coli 



(alternate 
crene designation) 



function 



Amp*^ 
Kan^ 
Tet^ 
Cam^ 



(Ap*^, bla) 

(Km^, neo ) 

(Tc^^ tet) 

(Cm^, cat ) 



CO lie in immunity 
TrpA"^ 



beta-lactamase 

aminoglycoside P-transf erase 

membrane pump 

acetyl transferase 

binds to colicin in vivo 

complementation of trpA 



Table 3 : Examples of selections for plasmid uptake and 
maintenance in S. cerevisiae 



function 

complements ura3 auxotroph 
complements trpl auxotroph 
complements leu2 auxotroph 
complements his3 auxotroph 
resistance to G418 



gene 

Ura3"*' 

Trpl"*" 

Leu2'*' 

His3'^ 

Neo^ 
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- Table 4 (continued) : Agents for Selection 

of DBP Binding in coli and Relevant Genotypes 



Notes: 

1) Deletions are* strongly preferred over point mu 
tions. 

2) Only secA gene need be controlled by DBP. 

3) Mutations in crp are highly pleotropic; some 
effects seen in cell wall. crp best used in conn 
tion with selections having intracellular action. 

4) Resistance to colicins can arise in several wa 
use of two or more E-colicins discriminates again 
other mechanisms. Because colicins do not replic 
they are preferred over phage for selection. Pha 
are useful to verify selection of cells repressin 
expression of ompA . 

5) Because colicins do not replicate, they are , 
preferred over phage for selection. Phage are us 
to verify selection of cells repressing expressio 
tsx. 
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Some Recommended Pairs of Selectable 
Binding Marker Genes 



A) Recommended pairs: 

qalT,K tetA 

araP pheS 

lacZ tetA 

dctA cvsK 

S£E thvA 

lamB thvA 

secA & pvrF 
malE - lacZ fusion 

£sx cvsK 

dctA thvA 



qalT, KpheS 



tetA 
PtsM 
ompA 
btuB 
tonA 
cir 



thvA 

thvA 

PvrF 

PyrF 

qalT.K 

CVSK 



aroP lacZ 



B) Less Preferred pairs: 
ietA arqP 
secA & lacZ 

malE - lacZ fusion 
PyrF thvA 
lainB qalT.K 



cir 
PtsM 
tonA 
crp 



tsx 
tetA 
PtsM 
lacZ 



Reason 

Both transport related. 
Both related to lacZ 
function. 

Both related to thymine 
Both related to sugar me 

bolism* 

Both related to colicin 
Both transport related 
Both transport related 
Both related to sugar me 

bolism 
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Tabic 6 Promoters 

A: Correlation between Sequence Homology and Promoter 
Strength CMUIiIi84) 



Promoter Homology score Log K gk2 

T7 Al 74 oO 7o40 



T7 (A2 73 o 4 7«20 

X P'f^ 58 o 6 7ol3 

lad UV5 59.2 6o94 

59,2 6-30 

T7 D 63o9 6o30 

63 o 9 6o00 

Tnio Pout 56 o 2 6o71 

TnlO Pij^ 52.1 6ol8 

X Pirn 49-7 4«71 

49.7 4ol7 



Pamp 52 o 7 



Pneo 58.0 
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Table 6 (continued) Promoters 



B: Sequences of some promoters 
Name -35 



-10 +1 



T7 Al GTATTGACTTAAAG TCTAACCTATAGGATACTTAC AGCCA 

T7 A2 GTATTGACAACATG AAGTAACATGCAGTAAGATACA AATCG 

X Pr GTGTTGACTATTTT ACCTCTGGCGGTTAGAATGGT TGCA 

lac UV5 GGCTTTACACTTTA TGCTTCCGGCTCA2ATAATGTG TGGA 

T7 D GCGTTGACTTGATG GGTCTTTATGTGTAGGCTTTA GGTG 

TnlO Pout GGGCAGAATTGGTA AAGAGAGTCGTG TAAAATA TC GAGT 

TnlO Pin AGGTGGATACACAT CTTGTCATATGATCAAATGGT TTCG 

X PRM TGTTAGATATTTAT CCCTTGCGGTGATAGATTTAA CATA 

Pamp ACATTCAAATATGT ATCCGCTCATG AGACAATA AC CCTG 

Pneo GAATTGCCAGCTGG GGCGCCCTCTGGTAAGGTTGG GAAG 
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^ Table 7 FUNCTIONAL SUBSTITUTIONS IN HELIX 5 OF \ 

REPRESSOR 
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Table 8: Some Preferred Initial DBFs 

X cl repressor 
X Cro 

434 cl repressor 
4 34 Cro 
P22 Mnt 
P22 Arc 

P22 cll repressor 
X cll repressor 
X Xis 
X Int 

CAMP Receptor Protein from coli 
Trp Repressor from coli 
Kr protein from Drosophila 

Transcription Factor IIIA from Xenopus laevis 

Lac Repressor from coli 

Tet Repressor from TnlO 

Mu repressor from phage mu 

Yeast MAT-al-alpha2 

Polyoma Large T antigen 

SV4 0 Large T antigen 

Adenovirus ElA 

Human Transcription. Factor SPl (a zinc finger protein) 
Human Transcription Factor API (product of jun ) 



Table 9, Table 10, and Table 11 have been deleted. 



wo 90/07862 



r 1 86- 



PCr/US90/00024 



H 

o 



CO o o ^ — 



CO 

tr" - 

o 
w 
o 
tr* 

o 
> 

^< 

td 

52; 
m 
t-" 
o 



o 



»0 

— ^ 3 



It* H- jo CO 



CO 



to 



cr 

M 

CD 

M 

,to 









H 




CO 


M 


CO 


s: 


W 


o 


2; 


1 


CO 


h8 


w 


W 




M 






> 


> 






M 






a 




o 


CO 




M 


M 


2; 








>^ 




50 












5d 




P3 




CO 




CO 




0 




5C 



wo 90/07862 PCT/US90/00024 

- 187- 



S; CO 

o ^ ^ 

M Of 



CO 

o 
w 
w 
< 
> 



r?^ lr< cn K 



h3 — 



0) 



Of 



n X — 



5^ 








S 


• 


o 
















w 




o 




tr" 


CO 


CO 




tr» 




Q 








< 








> 




o 


< 






o 


en 










M 




o 
a 




as 








> 
























> 










• 


o 





CO 



5^ 



M 

o 
I 



2; 
> 

o 
o 



cr 



M 
CO 
CO 

w 

CO 



M 
O 

CO 



2; 

CO 
CO 

o 
5d 



W6 90/07862 



PCr/US90/00024 



- 188° 



o 



<J1 



O 





CO 


cr 


cn cn 




c 




c: (D 




cr 




cr fcQ 
cn c 




cn 


a 




rt 


p- 


H- (D 


<D 


H- 


D 


rt 3 


CO 


rt 




c o 


H* 


c 




rt (D 




n- 


&) 








^t) 


o 




o 


i-h 


3 ' 




D 


H- 


cn 




W 


D 








H- 






rt 


ct 




IT 


cr 








at 


(t) 

cn 


t * 








\J 




no 












o 




o 


CO 


Mi 








c 






tn 


w 




o 


rr 


(D 




(t) 


ow 


w 

n 
















X5 






H- 


*^ 






D 


(D 




(T> 




cn 




Ul 




cn 




cn 




f\ 
\j 




o 










en 






a 


rt 






H- 


sr 


@ 




3 


Q 






CD 


cn 


»-( 




f1 


(D 


tn 




cn 


tn 


«: 




s: 












rt 




n- 




^3- 








3 








O 












o 








»^ 








(t) 
















cn 




O 




(0 




^1 




a 




3 








(D 












> 




rl 




tr 








H- 












3 








O 




































i-t> 




o 








52; 




H- 




>• 




3 







cn a 



cr 
cn 
rt 

H- 

rt 
C 
rt 

H* 

O 
3 

cn 

o 
o 
o 
c 
*^ 

H- 
3 

0) 

rt 

H- 
3 
rt 
(D 
^1 
3 



n3 
o 
cn 

H- 

rt 

H- 
O 
3 

tn 

fi) 
to 
cn 

3- 
O 
^ 
3 

cr 

(D 

o 

rt 
3- 



rt 



3 

(D 
1^ 

0) 
»-( 
(0 

tn 

3- 

i 

3 

cr 

% 

(D 

rt 
3- 
(0 

K 

H- 

a 

rt 
(D 

tn 
fl) 

fi> 

3 
O 
CD 



CO . 

c . 
cr 
cn 
rt 

rt 

rt 

H- 
O 
3 

cn 

o 
o 
o 
c 
^1 

H- 
3 

rt 

cn 
o 

< 
(D 
3 

rt 

(D 

o 
cn 

(0 

& 

►a 

o 
tn 

rt 

H- 
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a 
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(D 

cn 
cn 
o 

^1 



O ►6 



5d s: 



CO 



0) 

3*. 

tn 



M 

< 

CO 

<: 
w 

CO 

CO 
M 
> 

M 

w 

> 
< 

CO 
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— Pi 
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^co 


O 
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&» 
tr 

(D 



to 
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3 
rt 

H- 
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C 

(D 

a 



CO 
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w 
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td 

CO 
CO 

O 
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Table 13. MISSENSE MUTATIONS IN P22 ARC REPRESSOR 

THAT PRODUCE AN ARC" PHENOTYPE 

^ 10 15 20 25 30 

MKGMSKMPQFNLRWPREVLDLVRKVAEENG 

high yield . ^ ^ ^ 

Q R I C L C 

T K V 
medium yield ; , ^ 

^ A G A F G A 

E 

low yield . 

R LSKWQRS N HITKK 

F w L y C V T 

Y 

undetermined ^ . ^ ^ 

^ G S V 



35 40 45 50 55 

• • . 
RSVNSEIYQRVMESFKKEGRIGA 



high yield 



medium vield 




• * 


Y+S 

• • 


A 

low vield 


A 


T 

• a 


• 



WFGH AMSPQA G S K P 

li K K D C 
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TABLE 14 



MISSENSE MUTATIONS AT SOLVENT EXPOSED POSITIONS 
OF THE H-T-H REGIONS OF REPRESSOR PROTEINS 

Table 14a 
V Repressor 

(S) 

L - (N) 

Y(K)T R C 

35 40 45 50 55 

o e o a o 

HELIX 2 TURN HELIX 3 ■ 

QESVADKMGMGQSGVGALFNGINA 

* _ * * _ . . ,* „ . ,,. it * 

S P E Y L D D DDK 

K L V 

L S 



Table 14b \ Cro 

F K 
K R T 

15 . 20 25 30 35 

« • • • • 

HELIX 2 TURN HELIX 3 

GQTKTAKDLGVYQSA IN K A I H A GR K 
« ft ftftft 'ftft ft -ft ft 

R H D H N N Q T 

E P N L R T L 

C A Q 
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Table 14c 434 Repressor 

H 
L 
V 
T 
A 

20 25 30 35 
* • • • 
HELIX 2 TURN HELIX 3 

QAELAQKVGTTQQSIEQLEN 

A 
H 
L 
S 
M 
R 
P 
K 



Table 14d Trp Repressor 
T 

70 75 80 85 

HELTX 2 TURN HELIX 3 

QRELKNELGAGIATITRGSN 

S M C 

D H 
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Table 14, NOTES: ry ■ 

Positions in wild type repressors believed to contact DNA are 
indicated by a * below the wild type residue. 
Substitutions that greatly decrease repressor binding to DNA 
are shown below the wild type sequence. 

Substitutions that produce repressors with normal or nearly 
normal DNA binding affinities are shb^e the wild type 
sec[uence o 

Substitutions that increase . repressor affinity for -DNA are 
shown in parentheses above the wild type sequence* 



Table 15s deleted. 
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Table 16: Genetic Code Table 
With Secondary-Structure 
Preferences 



Second Base 



Base 


X 




c 




A 




G 




Third 
base 




1 F 


b/a 1 


S 


a/b J 


Y 


b 


1 c 


b 


|T 




1 "P 
1 


X5/ a 1 


o 


a/b 1 


Y 


b 


c 


b 


i C 




1 


a/D I 


S 


m/b 1 


stop 


stop 






1 T. 
1 


a/ Xj I 




a/b 1 


StOD 1 


W 


b/a 


1 G 




1 T. 


-a 1 
Q / D 1 






H 


a/b 1 


R 


b/a 


|T 




1 L 


a/b 1 


p 




H 


a/b 1 


R 


b/a 


1 c 


C 


1 ^ 


a/b j 


p 




Q 


b/a f 


R 


b/a 


1 A 




f L 


a/b 1 


p 




Q 


b/a 1 


R 


b/a 






I 


b 1 


T 


b 1 


N 


a/b 1 


S 


a/b 


|G 
|T 




I 


b 1 


T 


b 1 


N 


a/b 1 


S 


a/b 


1 c 


A j 


I 


b 1 


T 


b 1 


K 


a/b 1 


R 


b/a 


1 A 




M 


b I 


T 


b 1 


K 


a/b 1 


R 


b/a 


|G 




V 


b 1 


A 


a 1 


D 


a/b 1 


G 


b/a 


|T 




V 


b 1 


A 


a 1 


D 


a/b f 


G 


b/a 


1 c 


G 1 


V 


b 1 


A 


a 1 


E 


a 1 


G 


b/a 


1 A 




V 


b 1 


A 


a 1 


E 


a 1 


G 


b/a 


|G 



Amino acids denoted "b" strongly favor extended structures. 
Amino acids denoted "b/a" favor extended structures. 
Amino acids denoted "a/b" strongly favor helical structures 
Amino acids denoted "a" very strongly favor helices. 
Proline is denoted "-" and favors neither beta sheets nor 
helices. 



b: 

b/a: 
a/b: 
a: 



I, M, V, T, Y, C 

F, Q, R, G, W 

L, S, H, N, K, D 

A, E 

P 
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' Table 17 

Fraction of DNA molecules having 
n non-parental bases when 
reagents that have fraction 
M of parental nucleotide. 

Number of bases using mixed reagents is 30 • 



M 


,9965 


.97716 


.92612 


.8577 


.79433 


.63096 


fO 


.9000 


.5000 


.1000 


.0100 


.0010 


.000001 


f 1 ' 


.09499 


.35061 


.2393 


.04977 


.00777 


.0000175 


f2''"' 


, 00485 


.1188 


.2768 


.1197 


.0292 


.000149 


f3 


.00016 


.0259 


.2061 


.1854 


.0705 


.000812 


f4 


000004 


.00409 


.1110 


.2077 


.1232 


.003207 


f8 


0. 


2X10""' 


.00096 


.0336 


.1182 


.080165 


fl6 


0. 


0. 


0. 


5X10~"7 


.00006 


.027281 


f23 


0. 


0. 


0. 


0. 


0. 


.0000089 


most 0 


0 


2 


5 


7 


12 



fn is the fraction of all synthetic DNA molecules having n 
non-parental bases . 



"most" is the value of n having the highest probability. 
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Table 18: best vgCodon 

Program "Find Optimum vgCodon." 
INITIALIZE-MEMORY-OF-ABUNDANCES 

DO ( tl = 0.21 to 0.31 in Steps of 0.01 ) 
. DO ( cl = 0.13 to 0.23 in steps of 0.01 ) 
• . DO ( al = 0.23 to 0.33 in steps of 0.01 ) 
Comment calculate gl from other concentrations 

. . . gl = 1.0 - tl - cl - al 
. . . IF( gl .ge. 0.15 ) 

. • . . DO ( a2 = 0.37 to 0.50 in steps of 0.01 ) 

DO ( c2 = 0.12 to 0o20 in steps of 0.01 ) 

Comment Force D+E = R + K 

g2 = (gl*a2 -.5*al*a2)/(cl+0.5*al) 

Comment Calc t2 from other concentrations. 

t2 = 1. - a2 - c2 - g2 

IF(g2.gto Ooloand. t2.gto0.1) 

CALCULATE-ABUNDANCES 

COMPARE -ABUNDANCES-TO-PREVIOUS-ONES 

end_IF_block 

end_DO_loop ! c2 

end_DO_loop ! a2 

end_IF_block ! if gl big enough 

. . o o end_DO_loop i al 
. . . end_DO_loop 1 cl 
. oend_DO_loop 1 tl 

WRITE the best distribution and the abundances. 
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Table 19: Abundances obtained 
^ from optimum vgCodon 



Amino Amino 



aw Xu 




SI ^ "1 


/uj u It a cL fi (.^ t= 


A 


4 • oUx 






D 


6 . 00% 


T? 
£< 


O . 


F 


2.86% 


G 


6.60% 


H 


3.60% 


I 


2,86% - 


K 


5.20% 


L 


6.82% 


M 


2.86% 


N 


5.20% 


P 


2.88% 


Q 


3.60% 


■■-^ R 


6.82% 


s 


7.02% mfaa 


T 


4.16% 


V 


6.60% 


W 


2.86% Ifaa 


Y 


5.20% 


stop 


5.20% 







Ifaa = 


least-favored 


amino acid 




mfaa = 


most-favored 


amino acid 




ratio 


= Abun(W)/Abum(S) = 0.4074 




1 


fl/ratio^ j 


f ration j 


stop-free 


1 


2.454 


.4074 


.9480 


2 


6.025 


.1660 


.8987 


3 


14.788 


.0676 


.8520 


4 


36.298 


.0275 


,8077 


5 


89.095 


.0112 


.7657 


6 


218.7 


4.57 X 10~3 


.7258 


7 


536.8 


1.86 X lO'^ 


.6881 
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Table 20: Calculate worst codon. 
Program "Find worst vgCodon within Serr of given distribu- 
tion." 

INITIALIZE-MEMORY-OF-ABUNDANCES 

READ Serr Comment Serr is % error level. 

Comment Tli, Cli, Ali,Gli, T2i, C2i,A2i,G2i, T3i,G3i 
Comment are the intended nt-distribution. 

READ Tli, Cli, Ali, Gli 

READ T2i, C2i, A2i, G2i 

READ T3i, G3i 

Fdwn = l.-Serr 

Fup = l.+Serr 

DO ( tl = Tli*Fdwn to Tli*Fup in 7 steps) 
. DO ( cl = Cli*Fdwn to Cli*Fup in 7 steps) 
. . DO ( al - Ali*Fdwn to Ali*Fup in 7 steps) 
• . . gl = 1. - tl - cl - al 
. • « IF( (gl-Gli)/Gli .It. -Serr) 
Comment gl too far below Gli^ push it back 
. . c . gl = Gli*Fdwn 

. o o . factor = {l.-gl)/(tl + cl + al) 
. . . . tl = tl*factor 
. . o • cl = cl*factor 
. . . . al = al*factor 

end_IF_block 

. o o IF( (gl-Gli)/Gli .gto Serr) 
Comment gl too far above Gli, push it back 
. • . . gl = Gli*Fup 

.... factor = (l.-gl)/(tl + cl + al) 
. . . tl tl*f actor 
. . . . cl = cl*factor 
. . . . al = al*factor 
end_IF_block 

. . . DO ( a2 = A2i*Fdwn to A2i*Fup in 7 steps) 
• , • . DO ( c2 = C2i*Fdwn to C2i*Fup in 7 steps) 

DO (g2=G2i*Fdwn to G2i*Fup in 7 steps) 

Comment Calc t2 from other concentrations. 

t2 = 1. - a2 - c2 - ' g2 - j ; . - 

IF( (t2-T2i)/T2i .It. -Serr) 
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Table 20, continued^ Calculate worst codon* 
Coinment t2 too far below T2i, push it back;;; : ^ i ^ 
t2 = T2i*Fdwn 

o factor = (Ic-t2)/(a2 + c2 +.g2) 

a2 = a2* factor 

o o c2 = c2*factor _ 

g2 .= g2*f actor 

o o o end_IF_block 

o o o o « o IF( (t2-T2i)/T2i ogto Serr) : * 
Comment t2 too far above T2i, push it back , 
o o t2 = T2i*Fup 

o o o • o o - factor = (lo-t2)/(a2 + c2 + g2) 
, « o o o o o a2 = a2*f actor 
c2 = c2*f actor 

g2 = g2* factor . 

o o <, o « o o oend_IF_block 

IF(g2ogto OoO -ando t2 .gt* 0, 0) > ^ v: 

t3 = 0.5*(1.-Serr) 

g3 = 1- - t3 

. « . , . o o CALCULATE- ABUNDANCES 

COMPARE-ABUNDANCES-TO-PREVIOUS-ONES 

^ t3 = 0o5 

. o g3 = lo - t3 

o o o • o o o CALCULATE-ABUNDANCES 

COMPARE-ABUNDANCES-TO-PREVIOUS-ONES 

t3 = 0.5*(1.+Serr) 

g3 = 1. - t3 

CALCULATE-ABUNDANCES 

COMPARE-ABUNDANCES-TO-PREVIOUS-ONES 

o end_IF_block 

end__DO_loop ! g2 

end__DO_loop ! c2 

end_DO__loop ! a2 r 

o o « o end__DO_loop I al 

o o o end^DO_loop ! cl n-: . 

o oendJDOJLoop 1 tl /.v ..u oX^ :^ni^t-:^^D. 

WRITE the WORST distribution and the , abundances o ^ 
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Table 21: Abundances obtained 
using optimum vgcodon assuming 
5% errors 



Amino 
acid 


Abundance 


Amino 
acid 


Aburjd^n'^'^ 


A 


4.59% 


C 


2.76% 


D 


5.45% 


E 




F 


2.49* If^^ 


G 


6.63% 


H 


3.59% 


I 


2.71% 


K 


5.73% 


L 


6.71% 


M 


3.00% 


N 


5.19% 


P 


3.02% 


Q 


3.97% 


R 


7.68% mfaa 


S 


7.01% 


T 


4.37% 


V 


6.00% 


W 


3, 05% 


Y 


4.77% 


StOD 


5.27% 







ratio = Abun(F)/Abun(R) = 0.3248 



a ii/r^tio) 3 ir^tio ij stop-FT-.:>^ 

^ 3-079 .3248 .9473 

2 9.481 . .1055 .8973 

^ 29.193 .03425 .8500 

^ 89.888 .01112 .8052 

5 276.78 3.61 x lO'^ .7527 

« 852.22 1.17 X 10-3 .^225 

7 2624.1 3.81 X 10-4 .6944 



Table 22, deleted 
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Tables for Example 1 - 

Table 100: X Or3 Downstream 
of Pamp ^^^^ promotes galT,K 



5. } r.^T I r,GT I TA ^ 1 r^r.n ] r,CC | CTT I PTA i AAT | ACA | TTC | AAA \ .- 
Olig#4 3' att ac e '^"g aaa gat tta tqtlaaq tttt ' 

I TTp ;,T I I APal I - J =1^ 1- 



) T&T I GTA I Trr i GCT I CA T | r, AG | ACA I ATA I ACC | - 
af a cat ^q q rtaa at.H ctc tqt 1;^t tqq - 

1 -10 1- 



I nTT \ ATC I AC r- 1 Gr A I AGG \ G ^*^ j ftT^r I TAG I AGT 1 C 3« := .01ig#3 
qaa tag tqrr not t c -r-- eta tag atc t 5" 
I \ Ot>3 ^ II Xbal I 



wo 90/07862 



PCr/US90/00024 



201 



Table 101: X Or3 Downstream 
of Pneo "that promotes tet 



5' I I CCT I GCG [ AAC | CGG | AAT | TGC | CAG | - 
Olig#6 3' ggc cac tta acc tta acq atc- 
I Stui I I -35 L 



I CTG I GGG I CGC | CCT | CTG | GTA | AGG | TTG | - 
qac ccc qcq gga qac cat tec aac - 

I -10 L 



I GGA I TAT I CAC | CGC | AAG | GGA | TA 3 • = Olig#5 

ggg at a atq acq ttc cct att eg a 5 • 
J 5^r3 |HindITl| 
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Table 102: rav gene 
using lacUVS as proinoter 

gSel-gstEII- ( Ball -PBuMI-Bglll -BamH I -Aval) 

-KEnl-CTrp terminator) - Sf i l ; 1 



5'-ACTAGT CCAGG C TTTACA CTT TATGC TTCCG GCTCG TATAAT GTGT GG 
I Soel I 

AAT TGTGA GCGGA TAACA ATTTC ACAC 1 lacUVS 

A GGTAACC AGGAGGAAATAAA ! Bstlll & Shine-Dalgarno seq. 

I BstE2 I 

°* ® ^ r i t 1 k d y a m r 
^ 2 3 4 5 6 7 8 9 10 11 -.12 13 
ATG GAA CAA CGC ATA ACC CTA AAG GAC TAC GCG ATG CGC 

fgqtktakdl 
14 15 16 17 18 19 20 21 22 23 
TTT GGC CAA ACC AAG ACA GCG AAG GAC CTA 
I I IPpuM I I 

^vyqsainkai 
24 25 26 27 28 29 30 31 32 33 34 
GGG GTG TAT CAG AGC GCG ATT AAC AAG GCC ATC 



^ag^rkifitinad 
35 36 37 38 39 40 41 42 43 44 45 46 47 
CAT GCC GGC CGA AAG ATC TTC CTA ACC ATT AAC GCT GAT 

|Bal II I 
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Table 102 , continued 
g s V y a e 
48 49 50 51 52 53 

GGA TCC GTC TAG GCG GAA 

I BamHI I 



203 

e V k p f p s 

54 55 56 57 58 59 60 
GAG GTA AAG CCC TTC CCG AGT 

|Ava I I 



n k k t t: a . 
61 62 63 64 65 66 67 67 68 
AAC AAA AAA ACA ACA GCG TAA TAG TA GGTACC 

I Kpnl I 

agtcta agcccgc ctaatga gcgggct tttttttt ! terminator 
GGCCcgactGGCC -3 • ! Sf i I 
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pEPlOOl 
PEP1002 
PEP1003 

PEP1004 
pEPlOOS 

- pEPiooe 

pEPlOO? 

pEPiooe 

PEP1009 
pEPlOlO 

pEPlOll 



Table 103 s Catialoaue of plasmids 

pAA3H with 4o3 kbp deletion of \, Cla I 
site introduced. 

pEPlOOl with fd terminator and Spe I. Sfi 
I, Hpa I cloning site distal to galTJK. 

PEP1002 with Pi promoter replaced by pBR322 
amp promoter ( Pamp ) and Or3 upstream of 
qalT.K ? Pamp and Or3 bounded by Hpa I and 
Xba I and containing Apa I cloning site 
between Hpa I and Pamp . 

PKK175-6 with (Painp, cTalT,K . fd terminator, 
Ses If Sfi I cloning site) from pEP1003 

pEP1004 with Tn5 neo promoter (Pneo) and 
Or3 bounded by Stu I and Hind III. 

pEPlOOS with BamH I site removed by site- 
specific mutation. 

PEP1006 with f lacUVS , S.D., rav cloning 
site, trpa terminator) . 

pEPlOO? with N-terminal part of rav gene. 

pEPlOOS with complete rav gene. 

pEP1009 with Or3 replaced by scrambled 
Oj^3 sequence. 

PEP1009 with Or3 sequences replaced with * 



the HIV 353-369 Left Symmetrized Target. 



PEP1012 



PEP1009 With Or3 sequences replaced with 
the HIV 353-369 Right Symmetrized Target. 
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Table 103 (continued^ s Catalogu e of plasmids 



pEPllOO to 
PEP1199 

PEP1200 to 
PEP1299 

PEP1301 

PEP1302 
PEP1303 



pEPlOll with rav jj. 



PEP1012 with rav p, 



pEPllOO with rav ^" VF55, 



pEPllOO with rav j^" FW58. 



pSP64 with Tn5 neo 



PEP1304 



pEP13 03 with deletion of Ap resistance 
gene. 



PEP13 05 

PEP1306 

PEP1307 

PEP1400 to 
PEP1499 

pEPlSOO to 
PEP1599 



pEPieOO to 
PEP1699 



PEP2000 



PEP1304 with rav^" VF55o 
PEP1304 with rav L* FW58o 
pEP13 04 with rav . 

PEP1200 series plasmids with HIV 353-369 
substituted for Right Symmetrized Targets. 

pEP14 00 series plasmids containg 

modified rav j^ genes producing Ravj^ proteins 

that complement the rav^" VF55 mutation. 

pEP14 00 series plasmids containg 

modified rav p genes producing Ravp proteins 

that complement the ravL" FW58 mutation. 

pEP1009 with rav replaced by arc . 



PEP2 001 
PEP2002 



pEP2 000 with arc operator in Pneo, tet . 



pEP2001 with arc operator in Pamp . galT. K . 
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PEP2003 
PEP2004 
vgl-pEP2005 

vg2-pEP2006 

vg3"pEP2007 

PEP2010 
PEP2011 
vgl-pEP2012 

PEP3000 
PEP4 000 
PEP4001 
PEP4002 
vgl"pEP1233 



f continued) g Catalocme of plasmids 

pEP2002 with Target #1 in Pneo , tet . 

PEP2003 with Target #1 in Pamp , galT.K o 

pEP2004 with vgDNA (variegation #1 of 
polypeptide) » 

pEP2004 with vgDNA (variegation #2 of 
polypeptide) o 

pEP2 004 with vgDNA (variegation #3 of 
polypeptide) , 

PEP2002 with Target#2 in Pneo . tet , 

PEP2 010 with Target#2 in Pamo . aalT.K . 

PEP2011 with vgDNA (variegation #1 of 
residues 1-10) . 

PEP2004 with CI2-arcf 1-10) in place of arc . 
PEP2002 with Target #3 in Pneo, tet > 



PEP4000 with Target#3 in Pamp , aalT^K o 

pEP4001 with cro-hl2 in place of arc , 

PEP4002 with vgDNA (variegation #1 of 
polypeptide segment) • 
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Table 104: fd terminator 
and multiple cloning site 
to insert after aalT,K 



5' I CGA I AAG I GCT | CCT | TTT | GCA | GCC | TTT | TTT | TTT \ 
01ig#2 = 3' t ttc caa aaa aaa cat egg aaa aaa aaa | 

J fd terminator I 



I ACT I AGT I GAG | TGG | CCC | GAC | TGG | CCG | TTA | AC 3' = 01ig#l 

I tga tea | gtc acc ggg ctg acc ggc aat tag c 5 ■ 
I Spel I J Sfil I I Hoal I 



wo 90/07862 



208 



PCr/US90/00024 



Table 105: Mutagenic Primer 
to Remove BamHI .site ..from pEPlOOS 

; J t I p I V I 1 I w I i I 
I 93| 94] 95| 96| 97 | 98 | 

5' cc|aca|ccc1gtc|ctg|tgg|atc|- 

3 ' qq tat gag cag aac acc taT - 

- 1 1 I y I a I g I r I i 1 

i 99 1 100 I 101 1 102 1 103 1 104] 

I CTG|TAC|GCC|GGA|CGC|ATC|GT 3* pEPlOOS 

Aac atq egg cct qca tag ca 5 ' blig#7 

Bold, upper case bases indicate sites of mutation. 



Table 106 s deleted. 
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Table 107: Synthesis of lacUVS - Bst EII- Bal ll- 
Kpn I- trpa terminator 

5 ' I CTA I GTC I CAG | GCT | TTA [ CAC [ TTT [ ATG ) CTT | CCG | GCT | - 
Olig#9 = 3* aa gtc cga aat atg aaa tac aaa ggc caa - 
I Spel I I ' -35 I 



CGT I ATA I ATG I TGT I GGA 



/3' = 01ig#8 

ATT I GTG I AGC | GGA | TAA | CAA | TTT | 
gtg tgt tac aca cct taa cac teg cct att gtt aaa - 
[ -10 I I lac operator |_ 

Olig #11 = 3V 



/3' =01ig #10 

j CAC I ACA I GGT I AAC I CAGGAGAGA TCT A TGC | GGT | ACC | - 
gtg tgt cca ttg gtcctctct aga t acg cca tgg- 
I BstEII I I Bglll| I Kpnl | 

Olig #13 =3'/ 

I AGT I CTA I AGC | CCG | CCT | AAT | GAG | CGG | GCT | TTT | TTT | TT - 
tca gat teg ggc gga tta etc gcc cga aaa aaa aa - 
I spacer] trpA terminator |_ 



G I GCC I CGA I C 3' = Olig #12 
c egg g 5 ' 

J sfii L 



Table 108: deleted. 
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Table 109: Synthesis of 
First segment of rav .gene 



5' C I AGG I AGG | TAA | CCA | oaa | aaa \ aat | aaa | - 
iBstEII I 

I ATG I GAA I CAA | CGC | ATA | ACC | CTA | AAG | GAC | TAG | GCG | ATG | CGC | - 



r /3« = 01ig#14 

I TTT I GGC I CAA| ACC | AAG | ACA| GCG | A AG | GAC | CTA | 
Olig#15 = 3" gg ttc tat cac ttc eta aat - 
I Ball I I PpxiMI I 

1 GGG I GTG I TAT | CAG | AGC | GCG | ATT | AXC | AAG | GCC | ATC | 
ccc cac ata ate teg cac taa ttg ttc" caW taa - 



I CAT I GCC I GGC | CGA \ AAG | ATC | TTC | CTG | 
ata egg ccg get ttc tag aaa gac 5 ' 



wo 90/07862 



PCr/US90/00024 



211 

Table 110: Second segment of rav gene 



|r|k|i|f|llt|i|n|a|d| 
I 38| 39| 40| 41| 42| 43 | 44 | 45] A6\ 47 | 
5 • C I CGA I AAG | ATC | TTC | CTA | ACC | ATT | AAC | GCT | GAT | 
I Bqllll 



|g|s|v|y|a|e|e|v|k|p|f|p|sl 
I 48 1 49 I 50 1 51] 52 I 53 | 54 | 55 1 56 1 57 1 58 | 59 | 60 1 
I GGA I TCC I GTC | TAG | GCG | GAA | GAG | GTA | AAG | CCC | TTC | CCG | AGT | 
I BamHI | [strand overlap I | AvaX | 



In|klk|t|t|a|.|.|.| 
I 61 1 62 I 63 I 64 I 65 1 66 1 67 1 67 1 68 | 

|AAC|AAA|AAA|ACA|ACA|GCG|TAA|TAG|TAG|gta|cca|gtc| t 3 • 

I Kpnl ! 
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Table 111 ; X Or core sequences 
Used to search HIV-l 



1234567 

Kim ^ ^ Consensus-A 5" CCGCGGG 3 

3 » GGCGCCC 5 

Symmetric Consensus-A 5^ CCGCCGG 3 

3 " GGCGGCC 5 



Or3A 



5" CCGCAAG 3 
3 ' GGCGTTC 5 



OR3A/Symino Consensus. 6 5' CCGCAGG 3 

3 " GGCGTCC 5 



^R^A/Symmo Consensus ,5 5« CCGCCAG 3 

3« GGCGGTC 5 • ■ OR3S/Symm. Cons. 3 
7654321 



Kim Consensus-S 



Symm. Consensus-S 



Or3S 



Or3 S/Symm . Cons o 2 
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Table 112: Potential target binding sequences 
having subsequences matching 
six of seven bases 



CCGCGGG Kim consensus-A 
HIV-1 subsequence =ACTTTCCGCtGGGGACT 

353 I 

I 

CCGCAGG 0R3A/consensus . 6 
HIV-1 subsequence =TCTCGaCGCAGGACTCG 

681 t 

I 

CTTGCGG Or3S 
HIV-1 subsec[uence =TTTGACTaGCGGAGGCT 

760 t 
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Table 113s Potential target binding sequences 
having subsequences matching 'five of seven bases 

Symmetric consensus-S CCGGCGG 

HIV-l subsequence GACTTTCCGctGGGGAC 

352 ' I 

OR3S/consensuSo2 CCTGCGG 

HIV--1 subsequence TTTCCaCTGc[GGACTTT 

355 I 

0R3S/symm consensus o 3 CTGGCGG 

HIV-l subsequence TAGCAgTGGCGcCCGAA 

630 \ 

Symmetric consensus-A CCGCCGG 

HIV-l siibsequence CAGTGaCGCCdGAAbAG : ^ 

633 t 

02:3A/symm consensus • 5 CCGCCAG 

HIV-l subsequence CAGTGg[CGCCcGAACAG 

633 [ 

OR3 A/consensus , 6 CCGCAGG 

HIV-l subsequence GACTAaCGgAGGCTAGA 

763 \ 



symm consensus-S CCGGCGG 

HIV-l subsequence GACTAaCGGaGGCTAGA 

763 t 
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Table 113, continued: Potential target binding sequences 
having subsequences matching five of seven bases 



Or3A/syima consensuses CCGCCAG 

HIV-1 sxibsequence GAAGAtgGCCAGTAAAA 
4545 f 

0R3A/Consensus - 6 CCGCAGG 

HIV-1 subsequence ACAGAtaGCAGGTGATG 
5047 t 



OR3A/consensus . 6 CCGCAGG 

HIV-1 subsequence TCCTAtaGCAGGAAGAA 
5965 \ 
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Tables for Example 2 

Table 200: 
P22 arc operator 



P22 arc 
Operator 



5' ATGATAGAAG|C|ACTCTACTAT 3« 
3» TACTATCTTC I G I TGAGATGATA 5« 



consensus 
of half- 
sites 



5» ATrrTAGArkJslmyTCTAyyAT 3« 
3« TAyyATCTyinlslkrAGATrrTA 5' 



P22 arc left half operator 
P22 arc right half operator 



= ATrrTAGArk 
= layTCTAyyAT 
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Table 2 01 P22 Arc gene 

|in|k|g|mis|k| 
|112|3|4|5|6| 

GG I TAA I CCT | ATG | AAG | GGT j ATG | TCT | AAA | - 

|BstE II I 



lia|p|h|f|n|l|r|w|p|r| 
I 7 I 8 I 9 I 10 1 11 1 12 I 13 I 14 I 15 1 16 1 
I ATG I CCT I CAC | TTT | AAC | CTC | AGG | TGG | CCC | CGG | G- 

I BSU36I I I Xma l| 

|e|v|l|d|l|v|r|k|v|a| 
I 17| 18| 19| 20| 21| 22| 23 | 24 | 25| 26| 
I AG I GTC I CTT | GAT | CTT | GTT | CGC | AAG | GTT | GCT | - 
I PpuM 1 1 

|e|e|n|g|r|s|v|nls|e| 
I 27| 28j 29| 30| 3l| 32 | 33 | 34] 35| 36| 
I GAG I GAA I AAC | GGT | CGG ] TCC | GTT | AAC | TCT | G | - 

I Rsr II I 

I Hpa I I 
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Table 201, continued 

-|i|y|n|r|v|m|e|s|f|k| 
I 37| 38| 39| 40| 4l| 42| 43 | 44 | 45l 46| 
AG I ATT I TAT | AAT | CGC | GTT | ATG | GAG | TCG | TTC | AAG | - 
)Bal II I 

|k|e|g|rli|g|a| ,|.|.| 
I 47 1 48 I 49 I 50 1 51 1 52 I 53 I | | | 
I AAA I GAG I GGT I CGT I ATC j GGC I GCA I TAA I TAG I TGA I 

I GGT I ACC I 
I Kpn I I 



Amino acid sequence encoded is identical to wild type P22 
Arc. 



DNA sequence designed for optimal placement of restriction 
sites. 
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Table 202 Synthesis of P22 Arc gene 



5 ' - G I TAA I CCT [ ATG | AAG | GGT [ ATG [ TCT | AAA | - 
3 ' - qa tac ttc cca tac aga ttt - 
|BstE II I 

/=olig#400 

I ATG I CCT I CAC | TTT | AAC | CTC | AGG | TGG | CCC | CGG | - 
tac qqa ata aaa tta gag tec acc ggg gcc 3 '== 



I GAG I GTC I CTT | GAT | CTT | GTT | CGC | AAG [ GTT | GOT | - 
etc cag gaa eta gaa caa gcg ttc caa cga — 

/=olig#401 

I GAG I GAA I AAC | GGT | CGG | TCC | G TT | AAC | TCT | GAG [ - 
etc ctt ttg cca gcc agg c aa ttg aga cte - 

\=olig#406 

/=olig#402 

I ATC I TAT I AAT [ CGC | GTT | ATG | GAG | TCG | TTC | AAG | - 
tag ata tta gcg caa tac etc age aag ttc - 

\=olig#407 

I AAA I GAG I GGT | CGT [ ATC | GGC | GCA | TAA | TAG | TGA | - 
ttt etc cca gea tag ccg cgt att ate act - 

I GGT I AC 3« = olig#403 
c 5" = olig#408 
I Kpn I I 



Number of bases in each oligonucleotide. 



I BSU3 6I I 



olig#405 



400 = 43 



401 = 



48 



402 = 42 



403 = 47 



405 = 



50 



406 = 49 



407 = 38 



408 



= 34 
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Table 203s HIV-1 Subsequences 
that are similar to one half of 
the Arc Operator 



Number of 
mismatches 



123456789010987654321 - 
arcO =ATrrTAGArlc 
HIV-1 subsequence =ATtATAtAATACAGTAGCAAC 2 
1019 I 

1234567890] 0987654321 
arcO =ATrrTAGArk 
HIV-1 subsequence =ATAATAcAGTAGCAACCCTCT 1 

1024 I : ^r^^' ' . 

1234567890] 0987654321 
arcO = myTCTAyyAT 
HIV-1 subsequence =ACAGTAGCAACCCTCTATTgT 1 

1040 I 

1234567890] 0987654321 
arcO -ATrrTAGArk 
HIV-1 subsequence ^ATGATAGqGGGAATTGGAGGT 1 
2387 t 

1234567890] 0987654321 
arcO =ATrrTAGArk 
HIV-1 subsequence =tTGAcAGAAGAAAAAATAAAA 2 
2624 I ■ 
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Table 204 Synthesis of Pot ntial DBP-1 
vgl for pEP2004 

M K G M S K 
1 2 3 4 5 6 
5 ' - GCCGTACGG | TAA | CCT | ATG | AAG | GGT | ATG | TCT | AAA | - 
|BstE II I 

2 2 2 2 2 2 
M P Q F |I/M|Q/R|D/V|R/I|W/G|D/G| 
7 8 9 10| ll| 12| 13| 14| 15| 16| 
|ATG|CCT|C AC|TTT|ATs|CrG|GwT|AkAlkGGlGrTl- 



|-3« = olig#420 
2 2222 4.2 22 

|Q/L|R/T|F/Y|R/C|W/G| V \ Q | I/M | T/I j R/Q | 
I 17 1 18 1 19 1 20 1 21 1 22 I 23 | 24] 25 1 26 1 
I CWG I AsA I TwT I VGT | kGG | GTG | CAG | AT s | AyC | CrG | 
3 ' - cc cac ate taS tRa aYc - 



2222222222 
I V/I 1 R/I I F/y I D/V I T/I I R/Q | V/I | D/G | V/I | P/Q | 
I 27| 28| 29| 30| 3l| 32 | 33 | 34 | 35 | 36| 
j rTT I AkA I TwT I GwT | AyC | CrG | rTT | GrT \ rTT | CmG | 
Yaa tMt aWa cWa tRa aYc Yaa cYa Yaa qKc - 



I TAA I TAG I TGA | AAC | CTC | AGG | CGTGATCC 
att ate act tta aag tec acactagg -5'=olig#421 
I BSU3 6I I spacer! 
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Table 204, continued: NOTES 

s = equimolar C and G r = equimolar A and G 

w = equimolar A and T k = equimolar T and G 

y = equimolar T and C m = equimolar A and C 

n = equimolar A, C, G, and T ^ 

There are 2^4 = (approx- ) lo6 x lo'^ DNA and protein sequen- 
ces » 



Number of bases in each oligonucleotide, 
420 = 86 421 = 73 
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Table 205: Result of first variegation 

M K G M S K 
1 2 3 4 5 6 
I ATG I AAG I GGT | ATG | TCT | AAA | - 



MPQFMRDIWG 
7 8 9 10 11 12 13 14 15 16 
I ATG I CCT I CAC | TTT | ATG | CGG | GAT | ATA | TGG [ GGT | - 



QTYCGVQMTR 
17 18 19 2 0 21 22 2 3 24 25 2 6 
I CAG I ACA I TAT | TGT | GGG | GTG | CAG | ATG | ACC | CGG | - 



VIFDIRVGVP 
27 28 29 30 31 32 33 34 35 36 
GTT I ATA I TTT I GAT I ATC I CGG I GTT i GGT I GTT I CCG 
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Table 206 Synthesis of Potential DBP14 
vg2 for PEP2004 

M K G M S K 
1 2 3 4 5 6 
5 ' -GCCGTACGG | TAA \ CCT | ATG | AAG I GGT | ATG ( TCT | AAA | - 
|BstE TT| 

~M P Q F M|T RlQ D|N I|T W|R G|C 
7 8 9 10 11 12 13 14 15 16 
I ATG I CCT I CAC | TTT | AvG | CrA | rAT | AvT | yCG | kGT | - 

1=3' olig#424 
I 

^ V|D 

Q|H T|N y C G N|I Q|R M|T T|N R|C 
17 18 19 20 21 22 23 24 25 26 
I CAw I AmC I TAC | TGC | GGG | rxjT | Pr-fi | ayn [ ay^^ | | - 
3 ' -q ata acq ccc YWa aYc tRc tRa Rca - 
I overlap | 

V|F R|Q F|S D|N I|T Rjl VjC G|R V|D P|R 
27 28 29 30 31 32 33 34 35 36 
I kTT I CrG I TyT I rAT I Aye | AkA | GkT | sGT | GwT | CsG | 
Maa aYc aRa Yta tRo tMt cMa Sea cWa aSc - 

TAA I TAG I TGA | AAC | CTC | AGG | CGTGATCC 
att ate act t ta aaa tec acactaaq^ -5'=olig#423 
I BSU3 6I I spacer- 1 



BN8OOCID: <WD_e00788aA2_I_> 
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Table 206, continued: NOTES 

s = equimolar C and G r = equimolar A and G 

w = equimolar A and T k = equimolar T and G 

y = equimolar T and C m = equimolar A and C 

n = equimolar A, C, G, and T 



[24 sequences = 1-6 x 10^ sequences (DNA and protein). 



Number of bases in each oligonucleotide. 



424 = 78 423 = 81 
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- Table 207 Result of second selection 



M K G M S K 
1 2 3 4 5 6 
I ATG I AAG I GGT I ATG I TCT I AAA I 



M P Q F M R N I W G 
7 8 9 10 11 12 13 14 15 16 
I ATG I CCT I CAC | TTT \ ATG | CGA | AAT | ATT | TGG | GGT | - 



QTYCGDRMTR 
17 18 19 20 21 22 23 24 25 26 
I CAT I ACC I TAC | TGC | GGG | GAT | CGG | ATG ] ACC | CGT j 



FNSNIRGRVR 
27 28 29 30 31 32 33 34 35 36 
I TTT I AAT I TCT | AAT | ATC | AGA | GGT | CGT | GTT | CGG I 



|taa|tag|tga| 



'-N800C1D- <MO e007Sa2A3 I > 
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Table 208: Third variegation vg3 for pEP2004 

M K G M S K 
1 2 3 4 5 6 
5'- CGTCGCATGG | TAA | CCT | ATG | AAG | GGT | ATG | TCT | AAA | - 
[spacer | BstE II | 

M|K N|D W|S 

M P Q F E|V R T|A I R|P G 
7 8 9 10 11 12 13 14 15 16 
I A TG I CCT I gAC I TTT I rvG I CGg | rmT I ATA \ ysG | gGT | - 

Q|R y|C D|I R|Q 

G|E T H|R C G N|V G|E M T R 
17 18 19 20 21 22 23 24 25 26 
[ srG I ACA I vrT | TGT | GGG | rvT | srG | ATG | ACC | CGC | =olia#325 
olig#327 3'- c tac tag gcg - 
[ overlap | 



FjC S|N I|T g|d r|h 

V|G N R|H N P|L R R|H R V PjL 
27 28 29 30 31 32 33 34 35 36 
I kkT I AAT I mrT | AAT 1 myC | CGG | srT | CGT | GTT | CnT | 
MMa tta KYa eta KRg gtc SYa gca caa gNa- 



TAA I TAG I TGA | AAC | CTC | AGG | CGACCTGGC 
att ate act ttg gaa tec gctggaccg -5 ' 

I BSU36I I 

s = equimolar C and G r = equimolar A and G 

w = equimolar A and T k = equimolar T and G 

y — equimolar T and C m = equimolar A and C 

n = equimolar A, C, G, and T 

4I2 = = 1.6 X 10*^ protein and DNA sequences 
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Table 209: Polypeptide that" 
Binds First Target 



M K G M S K 
1 2 3 4 5 6 
I ATG I AAG I GGT | ATG | TCT | AAA I - 



MPQFV RDI RG 
7 8 9 10 11 12 13 14 15 16 
I ATG I CCT I CAC | TTT | GTG | CGG | GAT | ATA | CGG ( GGT ( ~ 



G T H C G I Q _M ■ T 
17 18 19 20 21 22 23 24 25 26 
I GGG I ACA I CAT I TGT I GGG I ATT I CAG I ATG I ACC I CGC I 



V N R N P R H R V L 
27 28 29 30 31 32 33 34 35 36 

I ATT I AAT I CGT I AAT I CCC I CGG j CAT I CGT I GTT I CTT I 
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Table 210: Variegation for Second Target vgl for pEP2011 

K|T 
A|V 

m|t g|d g|d m|t s|r k|q 

M V|A R|M A|V V|A N|K N|H 
0 1 2 3 4 5 6 
5'- CGTCGCATGG I TAA | CCT | ATG | rvG [ mG [ GnT | rvG | Ars | mAs | - 
I spacer | BstE II | 



F|y 

H|L 

M|K P|Q Q|H I|N V|I 

q|l r|l n|a v|d t|a r d i r g 

7 8 9 10 11 12 13 14 15 16 
I mwG I CnT I mAs | nvT | rvT | CGG | GAT | ATA | CGG | GGT [ - 



/ = olig#460 
GTHCGI) QMTR 
17 18 19 20 21 22 1 23 24 25 26 
I GGG I ACA I CAc j TGC | GGG | ATc \ CAG | ATG | ACC | CGC | 

olig#461 = 3 * - atq acq ccc tag ate tac tag aca - 
I overlap |_ 



VNRNPRHRVL 
27 28 29 30 31 32 33 34 35 36 
I ATT I AAT I CGT | AAT | CCC | CGG | CAT | CGT | GTT | CTT | 
taa tta gca tta qqg qcc qta qca caa qaa 



TAA I TAG I TGA | AAC | CTC | AGG | CGACCTGGC -3 • 
att ate act ttq gag tec qctqgaccg -5 • 
I BSU3 6I I spacer | 
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s = equimolar C and G r 

w - equimolar A and T k 

y = equimolar T and C m 

n = equimolar A, G, and T 



equimolar A and G 
equimolar T and G 
equimolar A and C 



2^4 = 1.6 X 10^ protein and 



DNA sequences 
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Table 211: Polypeptide Selected 
for Binding to Second Target 



M T R D M K Q 
0 1 2 3 4 5 6 
I ATG I ACG I AGG | GAT | ATG | AAG | CAG | - 



M QNDIRDIRG 
7 8 9 10 11 12 13 14 15 16 
I ATG I CAT I AAC | GAT | ATT | CGG | GAT | ATA [ CGG | GGT | - 



GTHCGIQMTR 
17 18 19 20 21 22 23 24 25 26 
I GGG I ACA I CAc | TGc | GGG | ATc | CAG | ATG ( ACC | CGC | 



VNRNPRHRVL 
27 28 29 30 31 32 33 34 35 36 
I ATT I AAT I CGT | AAT | CCC | CGG | CAT | CGT | GTT | CTT | 
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Tables for Example 3 
Table 300: CI2-arc(l-10) gene 











1 ^ 


1 1 k 


1 t 


I e 


1 w 












I 1 

X 


1 *^ 


I A 


1 cz 
\ *^ 


1 






GG 


|TAA 


CCT 




CTT 1 AAG 


ACT 


1 GAA 


TGG 






1 BstETT 1 




AX ± X 










p 


1 e 


1 1 


1 V 


g 1 


k 1 s 


V 


e 


e 1 




7 


1 ^ 

8 


1 9 


10 


1 T 1 


1 XJ 


14 


15 


16 1 




CCT 


GAG 


1 CTT 


GTT 






GTC 


GAG 


GAAj 




-a 


1 k 


1 k 


V 


i 


1 1 q 


d 


k 


P 1 


e 1 


17 


1 18 


1 19 


20 


21| 


22 1 23 


24 


25 


26 j 


27 1 


GCT 


AAG 


1 AAA 1 GTT 


ATT* 1 




GAT 


AAA 


CCT| 


GAG| 












JrS u X 






BSU3 6II 




a 


1 q 


i 1 


i I 


V 1 1 


P 


V 


g 1 






28 


1 29| 


30 


31| 


32 1 33 


34 


35 


36 1 






GCC 


CAA 


ATC 


ATA| 


GTA 1 CTT 


CCG 


GTT 


GGC| 












-L 


Sea 1| 












i 


V 1 


t 1 


m 1 


e ( y 


T 


1 


d 1 




37 


38 


39| 


40| 


41| 


42 1 43 


44 


45 


46 i 




ACT 


ATT 


GTTj 


ACCj 


ATG| 


GAG 1 TAT 


CGT 


ATT 


GAC| 










1 Nco I 


a 
















i StY I| 










r 1 


vj 


r 1 


1 1 


f I 


V 1 d 1 


k 1 


1 1 


d I 




471 


48 


49| 


50 1 


5l| 


52 1 53 1 


54 1 


55 1 


56 1 




CGC| 


GTT 


CGT| 


CTT| 


TTT| 


GTC 1 GAC 1 


AAA| 


TTG| 


GATj 














Acc I 


















|Hind II 1 


















i 


Sal I 1 











(continued on next page) 
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|n|i|ale|v|p|r|v|g|g| 
I 57| 58| 59| 60| .6l| 62 | 63 | 64 | 65 | 66 | 
I AAC I ATT I GCT | GAG | GTC j CCT | CGC | GTA | GGT | GGC | 

I Dra II I 
I ppuM I I 
I Pss I I 
I Avail I 



|k|ia|k|g|in|s|k|in|p|q| 
I 67| 68| 69| 70| 7l| 72 | 73 | 74 | 75 | 76| 
I AAA I ATG I AAA | GGT j ATG | TCT | AAG | ATG | CCG | CAA | 

I f I - I • I . I 

I 77| 78| 79| 80] 

I TTT I TAA I TGA | TAG | GGT | ACC | 

| ASP718 I 
I Kpn I I 



Residue Ml is inserted so that translation can initiate. 

Residue L2 corresp6nds to residue L20 of Barley chymotrypsin 
inhibitor CI-2 . 

Residues G66 and K67 are inserted to allow flexibility 
between CI-2 and the DNA-binding tail . 

Residues 68-77 have the same sequence as the first ten 
residues of P22 Arc. 
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Table 301: Synthesis of 
CI2-arc(l-10) gene 

|Bi|l|k|t|e|w| 
I 1 I 2 I 3 I 4 1 5 I 6 I 
5 ' -G I TAA I CCT | ATG | CTT ( AAG | ACT | GAA | TGG | 
3 '- qa -fcac aaa ttc ±.aa ctt acc 
I BstEIll I Afl I I 

/3 ' = olig#470 

P|e|l|v|g|kls| v|e| e | 

7 I 8 I 9 I 10|:ll|,12| 13| 14| 15| 16 | 

CCT I GAG I CTT | GTT | GGT | AAA | TCT \ GTC | GAG | GAA | 

qga etc aaa caa cca ttt aaa cag etc ct.ii 



|a| lclk|v|i|l|q|d|k I paI e 

1 17| 18| 19| 20| 211 22| 23 | 24 i 25 | 26| 27 

iGCTj AAG I AAA | GTT | ATC | C TG | CAG | GAT | AAA | CCT | GAG 

_saa ttc tt t caa tag aac ate eta ttt aaa cte 

^' t ! Pst T I I BSU3 6I 

olig#475 



/3 • = olig#471 
|a|qlili|v| l|p|vlg I 

I 28| 29| 30] 3li 32| 33 | 34 | 35| 36 ] 

I GCC I CAA I ATC | ATA | GTA | CTT | CCG | GTT | GG C|- 

cgg a tt tag tat cat gaa aac caa cc g - 

I Sea I I \ 5. oiig#476 

I t I i I V I t I m I e I y I r I i I d ! 
I 37| 38| 39| 40| 4l| 42 | 43] 44| 45 | 46} 
I ACT I ATT I GTT | ACC | ATG | GAG | TAT ] CGT | ATT | GAC | - 
tqa taa caa tgg tac etc ata qca taa eta 

I Nco 1 1 5' olig#477 f 

I Sty 1 1 



JNB0OC1D: <WO_9007882A2J_> 
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Table 301, continued 

/ 3' olig#472 
|r|v|r|l| f|v|d|kll|d| 
I 47| 48| 49| 50| 51 1 52 1 53 | 54 | 55 1 56| 

I CGC I GTT I CGT | CTT [ TTT | GTC | GAC | AAA | TTG | GAT | 

qcq caa aca aaa aaa cag eta ttt aac eta 

I Acc I I 
I Hind II I 
I Sal I I 



olig#473 3* 4 

|n|i|a|e|v|p| r|v|g|g| 

I 57 1 58 I 59 I 60 1 61 1 62 I 63| 64| 65| 66| 

I AAC I ATT I GCT | GAG | GTC | CCT | CGC | GTA | GGT | GGC | 

ttg taa caa etc cag gga acq cat cea ccg - 

I Dra li I I t 3' olig#479 

I PPUM I I I 

I Pss I I I = 5' olig#478 
I Avail I 



|Jc|m|k|g|in|s|k|in|p|q| 
I 67| 68| 69| 70| 7l| 72 | 73 j 74 | 75 | 76| 
I AAA I ATG I AAA | GGT | ATG | TCT [ AAG | ATG | CCG | CAA | - 
ttt tac ttt cca tac aaa ttc tac aac att - 



I f I . .1 . I . I 
I 77 I 78 I 79 I 80 1 

I TTT I TAA I TGA I TAG 1 GGT I AC- 3'= olig#474 
aaa att act ate e - 5 ' 

I Kpn I I 
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Table 301, continued 

Number of bases in each oligonucleotide. 



olig#470 
olig#471 
olig#472 
olig#473 
olig#474 



46 
57 
54 
48 
47 



olig#475 
olig#476 
olig#477 
olig#478 
olig#479 



53 
56 
31 
48 
55 
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Table 302: Variegation of Tail on CI-2 vgl for pEP3000 

|e|v|p|r|v|g|g| 
5' I 60| 61| 62 I 63 I 64 | 65 1 66 1 

caa ate ggc | GAG [ GTC | CCT | CGC | GTA | GGT I GGC | 

I spacer | PptiM I | 

3 • 

|k|m|k|g|iii|s|k|in|p|q| 
I 67| 68| 69| 70| 7l| 72 | 73 | 74 | 75 | 76| 
I AAA I ATG I AAA | GGT | ATG | AGC | AAG | ATG | CCG | CAG | - 

I f |I/M|Q/R|D/V|R/G| G I V I 
77| 78| 79| SOj 8l| 82 | 83 | 
I TTC I ATs I CrG | GwT | sGA | GGT | GTC | - 
olig#481 3'- ct cca caa - 

/ • 3 • = olig#480 

I Q/L I R/T I F/Y I R/C | W/G | V/D | Q/R | I/M | T/1 j R/Q | 
I 84| 85| 86| 87| 88| 89 | 90| 91] 92 | 93 | 
I C wG I ASA I TwT I yGT | kGG | GwC | CrG | ATs | AyC | CrG | - 
g Wc tSt aWa Rca incc cWc aYc taS tRg aYc- 

I V/I I R/I I F/Y I D/V I T/I I R/Q | V/I ] D/G | V/I | P/Q | 
I 94 I 95 I 96 I 97] 98 | 99 | ICQ | 101 1 102 ] 103 | 
I rTT I AkA I TwT | GwT | AyC | CrG [ rTT | GrT | rTT j CmG | 
Yaa tMt aWa cWa tRg gYc Yaa cYa Yaa gKc- 

, I I • I • I 
I 104 I 105 I 106 1 
I TAA I TGA I TAG | GGT | AC 
att act ate c - 5 ' 

I Kpn I I 

2^4 DNA and protein sequences = (approx) 1.6 x 10^. 

Number of bases in each oligonucleotide. 
480 ... 82 481 ... 78 
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Tables for Example 4 
Table 400 Search for X Cro Half Site in 
HIV non-variable regions . 

Gene seq id=HIVHXB2CG 

Sequences sought 

Oj^3 -/consensus 
Consensus ^R^" hybrid 

forward 5 • T^TCACC 3 » 5 • T^TCCCT 3 « 5 • TATCACT 3 • 

reverse 5 » GGTGATA 3 » 5 » AGGGATA 3 • 5 » AGTGATA 3 » 

Match with Or3- 

matches —TATCCCT 
HIV subsequence =aATCtCTAGCAGTGGCG 
624 t 

Match with Or3/consensus hybrid 

matches =TATCACT 
HIV sxibsequence =cLATCtCTAGCAGTGGCG 
624 t 

Match with Consensus 

matches =GGTGATA . /: 

HIV subsequence =ACAGATGGCAGGTGATg ^ . . ^.v- 

5057 I 



OP.:-. 
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Table 400, continued 

Match with Consensus 

matches =TATCACC 
HIV subsequence =cATCtCCTATGGCAGGA 
5961 \ 

First target : TATCCCTAGCAGTGGCG 

Second target: aATCtCTAGCAGTGGCG 
624 I 
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Table 401; X Cro alpha 1 & 2 & slot for Polypeptide 

I «i I e I q I r I i I t I 
|1|2|3|4|5|6| 
5' cqa I CGG | AGG | TAA | C CT | ATG | GAA [ CAA | CGC | ATA | Arr | - 

I Spacer | BstE XT' | , , 



olig#483 3 '4, 
|l|k|d|y la|m|r|f|g|q| 
I 7 I 8 I 9 I 10 1 111 12 1 13 1 14 1 15 1 16 1 
I CTA I AAG I GAC | TAG | GCG | ATG | CGC | TTT | GGP | r A A | 

qc tac acq aaa ccct att - 
|Bal I I 

|tjk|t|a|k|d|l|g|v| 
I 17| 18| 19| 20| 21] 22| 23 | 24 | 25 | 
I ACC I AAG I ACA I GCC I AAA I GAT I CTC I GGG I GTG I 
tqq tt c tat egg ttt eta gag ccc cac - 

|Bql II I 

I Ava 1 1 



i • I • I . I 

I I I I 

I TAG I TAG I TAG I GGT I ACC I AAG I GCG I 
ate at e ate cca tgg ttc cqc - 5' olig#484 
I Kpn T I sanc«»T-| 

Number of bases in each oligonucleotide. 



483 60 



484 ... 65 
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Table 402 Variegated Polypeptide to attach to 
Cro Helices 1, 2, & 3 vgl for pEP4 002 

k I d I 1 I g I V I 
2l| 22| 23| 24| 25 | 

ccq I acq | gcc | cgA | GAT | CTC | GGG | GTG | - 

I spacer | Bal II | 

I Ava 1 1 

|y|q|s|a|i|n|k|a|i|h| 
I 26| 27| 28| 29| 30| 3l| 32 | 33 | 34 | 35 | 
I TAT I CAG I AGC | GCG | ATT | AAC | AAA [ GCG | ATC | CAC | - 

i|M q|r d|v r|i w|g d|g q|l r|t F|y r|c 

I 36| 37| 38| 39| 40| 4l| 42 | 43 | 44 | 45 | 
I ATs I CrG I GwT | AkA [ kGG | GrT | CwG | AsA | TwT | vGT | - 

4- 3' = olig#486 
W|G V Q I|M T|I R|Q V|I R|l F|Y D|V 
I 46| 47| 48| 49| 50| 5l| 52 | 53 | 54 | 55 | 
I kGG I GTG I CAG | AT s | AyC | CrG | rTT | AJcA | TwT | GwT | 
cc cac gac taS vRa aYc Yaa tMt aWa cAa - 

T|I R|Q V|I D|G V|I P|Q 
I 56 1 57 1 58 1 59 1 60 1 61 1 
I AyC I CrG I rTT | GrT | rTT | CmG | 

tRg qYc Yaa cYa Yaa aKc - 




I TAG I TAG I TAG | GGT | ACC | AAG | GCG | 
ate ate ate cca tgg ttc cgc 5' = olig#488 
I Kpn I I sapcerl 

s = equimolar C and G r = equimolar A and G 

w = equimolar A and T k = equimolar T and G 

y = equimolar T and C m = equimolar A and C 

n = equimolar A, C, G, and T 
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Table 403 
Result of first variegation of ,, 
alpha 1,2,3; vgPoly Pept ide 











1 ™ 


e 


q 


r 


i 


1 1 










1 1 




3 


4 


5 . 


1 ^ 1 










I ATG 


GAA 


CAA 


CGC 


|ATA 


1 ACC| 


1 1 


k 


d 


y 


a 


m 


r 


f 


g 


I q 1 


1 7 




9 


10 


! 11 


12 


13 


14 


15 


1 1^1 


1 CTA 


AAG 


GAG 


TAC 


GCG 


ATG 


CGC 


TTT 


GGC 


|CAA| 


1 t 




t 


a 


k 


d 


1 


g 


V 


y j 


1 17 


18 


19 


20 


21 


22 


23 


24 


25 


I 26| 


1 ACC 


AAG 


ACA 


GCC 


AAG 


GAC 


CTA 


GGC 


GTG 


!tat| 


1 q 




a 


i 


n 


k 


a 


i 


h 




1 27 


28 


29 


30 


31 


32 


33 


34 


35 




1 CAG 


AGC 


GCG 


ATT 


AAC 


AAA 


GCG 


ATC 


CAC 




M 


Q 


V 


R 


G 


D 


L 


T 


Y 


c 


1 36 


37 


38 


39 


40 


41 


42 




44 


45 1 


1 ATG 


CAG 


GTT 


AGA 


GGG 


GAT 


CTG 


ACA 


TAT 


TGT 1 


W 


V 


Q 


I 


I 


R 


V 


R 


.F 


D 


1 46 


47] 


48 


' 49 


50 


51 


52 


53 


54 


55 1 


1 TGG 














AGA 


TTT 


gat] 


T 


R 


V 


G 


I 


Q 










1 56] 


57 1 


58] 


59 


60 


61 1 










|ACC| 


CGG] 


GTT| 


GGT 


ATT 


CAG] 










I . 1 

I 1 
|TAG 


e j 

TAG 


o 

TAG 





































3MSCOCID: <WD . , fiO07832A2 I > 
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Tables for Example 5 



Table 500: Proposed binding of Arc dimer to arcO. 

(a) Interaction of residues 1-10 
with aypQ 



Arc N >C C< N 

arcO 5 • ATrrTAGArksmyTCTAyyAT 
3 ' TAyyATCTymskrAGATrrTA 



(b) N-terainal residues interacting with same 
polypeptide chain, dimer contacts near C-terminus 



2 C CI 

\ / 

/vvvwvvvx 
/ \ 

VVV"\, ,/"VVV 
Arc 1 N I 1 N 2 

arcO 5 ' ATrirTAGArksmyTCTAyyAT 

3 • TAyyATCTymskrAGATrrTA 
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Table 500, continued 

(c) N-terminal residues interacting with opposite 
polypeptide chain, dimer contacts close to residue 10. 



2 C CI 

/ VVV V vv\ 

\ 

vvv~xr~vvv 

Arc 1 N / \ N 2 

arcO 5 • ATrrTAGArksmyTCTAyyAT 
3 « TAyyATCTymskrAGATrrTA 
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Table 501 : Search of HIV-1 isolate HXB2 DNA sequence 
for sequences related to one half of arcO 

In arcO sequence, upper case letters represent palindromical- 
ly related bases. 

In HIV-1 subsequences, represents a nucleotide found to 

vary among HIV-1 isolates while lower case letters represent 
mismatch to arcO* 

HIV-1 1016-1051 is non-variable o 

arcO left half = ATrrTAGArk 

HIV-1 subsequence =§@ATCATTATATAATAcAGTAGCAACCCTCTATTGTGT@ 



1024 f 



arcO right half 
HIV-1 subsequence 



== myTCTAyyAT 
=CAGTAGCAACCCTCTATTgTGT@ 
1040 t 



2387-2427 is non-variable. 



arcO left half 



= ATrrTAGArk 



HIV-1 subsequence =@ATGATAGgGGGAATTGGAGGTTTTATCAAAG 

2387 I 



4661-4695 is non-variable • 



arcO left half = ATrrTAGArk 
HIV-1 subsequence =AAGTCAAGGAgTAGTAGAATCTATGAATAA@ 

4676 t 
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Table 502: Progression of Targets I^eading to 
HIV-1 1016-1037 

(a) 1016 1024 center 1047 

i ' ^ <t ^ 

HIV-1 5 • ATCATTATATAATAcAGTAGCAACCCTCTATT 
First target 5« TAcATaATAaAaacaCtAtaCTaT @ & & 

P22 sequence 5' attg acATaaTAGAaacacTCTActATattctcaata 3' 
3 ' taactaT ActATCTtcataAGATaaTAtaaaaa ttat 5 ' 
J arcO I 



In target: Upper case indicates that HIV-1 and arcO 

agree* 

Lower case indicates a change to match arcO . 

Underscore indicates identity to arcO. 
@ indicates bases that vary between instances 
of target* 

In arcO : underscore indicates DNase I protected, 

lower case indicates not palindromically 
related. 
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Table 502, continued 
Novel DBP = ] 



NXXXXX >C C< XXXXXN 



First: "target = 




In the Novel DBP, X represents variegated sequence. 

In the line between Novel DBP and target DNA: 

represents regions where variegated sequence will 
produce amino acid sequences that will bind 
specif icially. 

N & C are the amino and carboxy ends residues 1-10- 

I or \ represent regions where constant amino acid 



sequence is known to bind DNA. 



# 



represents regions where constant amino acid 
sequence is believed not to bind DNA. 



S 



represents regions where DNA sequence varies 
between different instances of the target. 
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Table 502s Progression of Targets Leading to 
HIV-1 1016-1034 
(continued) 



(c) 1016 1024 center 1047 

HIV-l 5» ATCATTATATAATAcAGTAGCAACCCTCTATT 

First target 5« T AcATaATAaAaacaCtAtaCTaT 
changes ^ ^ ^ ^ ^ U 

Second target 5" TAcATAATA CAG qcaCtA tCCT@ @ 

P22 sequence 5« attg acATaaTAGAaacacTCTActATattctcaata 3 
3 ' taactqTActATCTtcgtqAGATgaTAtaagaa ttat 5 
J arcO I 



(d) ^ 

Novel DBP = N XXXXXXXX>C C<XXXXXXXX— N 

Mil Ill I $$$$$$$$$$$$ 

Second target = @TAcATAATACAGgcaCtAtCCT@§@@@§ 
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Table 502, continued 

(e) 1016 1024 center 1047 

HIV-1 5' ATCATTATATAATACAGTAGCAACCCTCTATT 

Second target 5" ^eeTAcAT AATA CAG acaCt AtCCT 

changes ^ ^ ^ ^ ^ 

Third target 5' CATTATAT AATA CAGTAaCAACCg e 

P22 sequence 5' af.tiga ^ATaaTAGAaa cacTCTActATattctcaata 3' 
3 • taactaTActATCTtcataAGATaaTAtaagaa ttat 5 ' 
J arcO L 



(f) diffuse variegation 

Novel DBP = NXXXXXXXXXXXXX>C C<XXXXXXXXXXXXXN 

I I $$$$$$$$$$$$$$$$ 

Third target = @ @ C ATTAT AT AAT AC AGTAa C AACC @@@@e@@e@@ 
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Table 502: Progression of Targets Leading to 
HIV-1 1016-1034 
(continued) 

(g) 1016 1024 center 1047 

HIV-1 5« ATCATTATATAATAcAGTAGCAACCCTCTATT 

Third target 5' @§CATTATATAATACAGTAaCAACC@@@@8§@ 

changes _ ^ ^ ^ 

Fouirth target 5" ATCATTATATAATACAGTAGCA@@§ 

P22 sequence 5' attg acATaaTAGAaqcacTCTActATattctcaata 3' 
3 » taactqTActATCTtcgtgAGATgaTAtaaoaa ttat 5 ' 
J arcO I 



(h) diffuse variegation , _ 

Novel DBP = NXXXX >c C< XXXXN 

1 1 1 1 1 M 1 1 1 M 11$$$$$$$$$$$$$$$$ 

Fourtii target 5« ATCATTATATAATACAGTAGCA@@@@§@@@§§§§@ 
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Table 503: First Target Dovmstreain 
of Promoters of Selectable Genes 

First Target downstream 
of Pamp that promotes galT. K 

5' I CCT I GCG I AAC | CGG [ AAT | TGC | CAG | - 
Glig#501 = 3« gga coc ttcr acc tta acq ate - 
I Stui I I -35 I 



I CTG I GGG I CGC | CCT [ CTG | GTA | AGG | TTG | GGA | - 
qac ccc acq gga aac cat tec aac cot - 

I -10 I 

1024 

I TAG I ATG I ATA | GAA | GCA | CTA | TAC | TAT | A 3« = Olig#502 

atg tac tat ctt cgt gat atg ata t teg a 5 • 
J First Target [ |Hind3 | 
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Table 503^ continued 

First Target downstream 
of Pneo that promotes tet 

5«" I CTT I CTA I AAT | ACA | TTC | AAA | 
Olig#503 3" c caa aaa gat tta tat aag ttt - 

I Apal I I -35 I 



[ TAT I GTA I TCC | GCT | CAT | GAG | ACA | ATA | ACC | CT | - 
ata cat aaa caa ata etc tat tat tag aa - 

I -10 I 

1024 

I TAC I ATG \ ATA | GAA | GCA | CTA | T AG | TAT | CGT 3' = Olig#504 

a-tq tac -ta-t c-tt cat oat: ata ata r gca gat c /, 5 ! r 
J First Target | | Xbal \ • 
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Table 504 First variegated 
insert into ped gene 

|in|x|X|x|X|X| 
5' I 1 I 96| 97| 98] 99|100| 

I cct I cag I cGG | TAA I CCT I ATG | f 2k I f zk I f zk I f 2k I f zk I 
I spacer | BstE TI [ 

■|in|k|g|in|s|kl 
I 101 1 102 I 103 I 104 I 105 I 106 | 
I ATG I AAG I GGT | ATG | TCT | AAA | 

|in|p|h|f|n|l|r| Center of symmetry for 

I 107 I 108 I 109 I 110 I 111 I 112 I 113 | 4, priming. 

I ATG I CCT I CAC | TTT | AAC | CTC | AGG | cgt | att | aat | acg | cct | g-3 ' 

I Bsu36I I I primer I 

olig#605 f * 

Self priming 

CTC I AGG I cgt | att | \ 

3'- tec gca taa / 

3' end self primes for extension with Klenow enzyme, 

f = (0.26 T, 0.18 C, 0.26 A, 0.30 G) 
2 = (0.22 T, 0.16 C, 0.40 A, 0.22 G) 
k = equimolar T and G 

There are (2^)^ =3.2 x 10^ different DNA sequences encoding 
20^ - 3.2 X 10^ different prptein sequences. 

100 has been added to residue numbers for wild-type Arc, 
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Table 505s Protein Ped-6 
Selected for Binding to First Target 







10 


k 1 d 


i 


w 1 r 1 






1 


96 1 97 


98 


99 1 lOO 1 
^ ^ j ^ \j \j 1 






ATG 


AAG 1 GAT 


ATT 


TGG 1 CGT 1 






m 


k 1 g 


m 


s 1 k 1 






TOT 

-1- w JL 


1 no 1 1 n^^ 


M\j*± 












Ax \s 


T'OT* 1 Ti. 1\ A 1 

it-l 1 iiAA 1 


1 .?n 1 p 1 


h 1 f 


n 


1 I- r 


W 


P 1 r 1 


1 107 1 108 


109 i 110 


111 


112 1 113 


114 


115 1 116 1 


1 ATG 1 CCT 1 CAC \ TTT 








CCC 1 CGG 1 G 






1 


Ren "5 ^ T 




1 xma 1 1 


i 1 v'l 


1 i d 


1 


V 1 r 


K 


V 1 a 1 


1 117 1 118 


119 j 120 


121 


1221123 


124 


125 1 126 1 


1 AG|GTC 


CTT } GAT 


CTT 


GTT 1 CGC 


AAG 


GTT 1 GCT I 


1 PduM 












i e 1 e 


n 1 g 


r 


s j V 


n 


s 1 e.l 


1 127 1 128 


129 1 130 


131 


132 1 133 


134 


135 1 136 1 


1 GAG 1 GAA 


AAC 1 GGT 


CGG 


TCC 1 GTT 


AAC 


TCT 1 G 1 






1 Rsr IX 1 







I Hpa I I 



|i|y|n|riv|m|e|s|f|k| 

I 137 I 138 1 139 1 140 1 141 1 142 1 143 1 144 1 145 1 146 j , . 

AG I ATC I TAT | AAT | CGC | GTT | ATG | GAG | TCG | TTC | AAG | 

|Bal II I 

|l^|elg|rli|g|a|.|.|,| 

I 147 I 148 I 149 I 150 1 151 1 152 1 153 I . . | | T^ r V-r t 5. 

I AAA I GAG I GGT I CGT I ATC I GGC I GCA 1 TAA I TAG I TGA I 

|GGT|ACC| 
I Kpn I I 
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Table 506: Second Target Downstream 
of Promoters of Selectable Genes 

Second Target downstream 
of Pamp that promotes galT, K 

5 ' I CCT I GCG I AAC | CGG | AAT | TGC | CAG | - 
01ig#541 = 3' qga cac tta acc tta acq ate - 
I Stui I I -35 I 

I CTG I GGG I CGC | CCT [ CTG | GTA | AGG | TTG | GGA [ - 
qac ccc acq qqa aac cat tec aac cct - 

I -10 I 

1024 

I TAG I ATA I ATA | CAG | GCA | CTA | TCC | T | A 3' = 01ig#542 

atq tat -fcat ate cat gat aqg a t tcaa 5' 
I Second Target I | Hind3 | 



Second Target downstream 
of Pneo that promotes tet 

5 • I CTT I CTA I AAT | ACA | TTC | AAA | - 

01ig#543 3" c egg aaa aat tta tat aag ttt - 

I Apal I I -35 I 

[ TAT I GTA I TCC | GCT | CAT | GAG | ACA | ATA | ACC | CT | - 
ata cat agg cga gta etc tgt tat tgg aa - 

I -10 I 

1024 

TAG [ ATA I ATA | CAG | GCA | CTA | TCC | T [ CGT 3' =01ig#544 
atq tat tat gtc cgt gat agg a gca gat c 5 • 
Second Target \_ | Xbal | 
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Table 507 : Variegation for selection with 
Second Target ' 

R|k 

I m I k I d I i I w e|g 
I 1 I 96 1 97 I 98 j 99 I 100 J 
5 ' -caa I eta | CGG | TAA | CCT | ATG I AAA | GAT | ATC | TGG | rrA | - 
I spacer | BstE II j 

***** 
M|r K|q G|d M|i S|r Kjq 
v|g t|p h|r f|l n|k t|p - 
1 101 1 102 I 103 1 104 I 105 1 106 I 
I rkG I mmG | srT | wTk | Ark | mmA \ - 

* * 

M|r P|q H|y F|y . . k . - 

vig r|l s|p v|d n j 1 | r | Center of ^ symmetry . 
1 107 1 108 1 109 1 110 I 111 1 112 1 113 I 4. for priming 

I rkG I CnG I vmT | kw T | AAC | CTC I AGG | ccrt | att | aat [ acq | cct | g -3 

I BSU3 6I I I primer [. 

k = equimolar T and G r = equimolar A and G 

w = equimolar T and A . s = equimolar C and G 

m = equimolar A and C y = equimolar T and C 

Approximately 4 x 10^ DNA and protein sequences. 

* indicates sites of one alternative variegation. 
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Table 508: Protein Ped-6-2 
Selected for Binding to Second Target 



I M I P I Y I F 
I 107 I 108 I 109 I 110 
I ATG I CCT I TAG I TTT 



I e I V I 1 I d 
I 117 I 118 I 119 I 120 
I AG I GTC I CTT | GAT 
I PpuM 1 1 

I e I e I n I g 
I 127 1128 I 129 I 130 
I GAG I GAA I AAC I GGT 



1 ^ 


k 1 


d 


i 


W 


E 1 


1 ^ 


96 1 


97 


98 


99 


100 1 


1 ATG 


AAG 


GAT 


ATT 


TGG 


gag| 


1 ^ 


Q 1 


G 


M \ 


R 


T 1 


1 101 


102 


103 


104 


105 


106 1 


1 AGG 


CAG 


GGT 


ATG 


AGG 


aca| 


1 ^ 


1 


r 


w 


P 


r 1 


1 


112 


113 


114 


115 


11.6 1 


1 AAC 


CTC 


AGG 


TGG 


CCC 


cgg|g 


1 


BSU36I 




1 Xitia II 


-4— 

1 1 


V 




k 


V 


a 1 


1 121 


122 


123 


124 


125 


126 1 


1 CTT 


GTT 


CGC 


AAG 


GTT 


GCT| 


i ^ 


s 


1 V 


n 


s 


e 1 


1 131 


[ 132 


133 


1 134 


135 


1 136] 


|CGG 


TCC 


[GTT 


1 AAC 


TCT 


i ^ i 


1 Rsr II 


J- 









I npa I I 



|ily|n|r|v|m|e|s|f|k| 
I 137 1-138 I 139 I 140 I 141 1 142 | 143 | 144 | 145 1 146 | 

AG I ATC I TAT | AAT | CGC | GTT | ATG | GAG | TCG | TTC | AAG | 

|Bgl III 



|k|elg|r|i|g|al.l.|.| 
I 147 I 148 I 149 I 150 I 151 1 152 I 153 I | ] | 
I AAA] GAG I GGT ] CGT | ATC | GGC | GCA | TAA| TAG | TGA| 



I GGT I ACC I 
I Kpn I I 
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Table 509 2 Protein Ped-6-2-5 
Selected for Binding to Third Target 





m 






V 


W 


1 H- 




1 




97 


98 


99 


100 




ATG 


AGG 


GAT 


GTT 


TGG 


CAT 




V 


R 


N 


I 


T 






101 


102 


103 


104 


105 


106 




GTG 


CGG 


AAT 


ATT 


ACG 


AGA 


V j R 1 H 1 L 


n 


1 


r 


w 


P 


r 


107 1 108 1 109 1 110 


111 


112 


113 


114 


115 


116 


GTG 1 CGT 1 CAC | TTG 


AAC 


CTC 


AGG 


TGG 


CCC 


CGG 




4- 


Bsu3 61 




J Xma 


e 1 V 1 1 1 d 


1 


V 


r 1 


k 


V 


a 


117 1 118 1 119 1 120 


121 


122 


123 


124 


125 


126 


AG 1 GTC 1 CTT | GAT 


CTT 


GTT 


CGC 


AAG 


GTT 


GCT 


1 PduM T| 














e 1 e 1 n 1 g 


r 


s 1 


V 1 


n 


s 


e 


127 1 128 1 129 1 130 


131 


132 


133 


134 i 


135 


136 


gag1gaa|aac|ggt 


CGG 


TCC 


GTT 


AAC| 


TCTj 


G 



Rsr II 



Hoa I I 



|i|y|n|r|v|m|e|s|f|k| 
I 137 1 .138 1 139 1 140 I 141 1 142 1 143 1 144 1 145 1 146 | 

AG I ATC I TAT | AAT { CGC | GTT ] ATG ( GAG | TCG | TTC | AAG | 

|Bql III 



I k I e I g I r I i I g i a I . I . I . I ^ 

I 147 1 148 I 149 1 150 1 151 1 152 1 153 | | | | ; v ; , ^ 

I AAA I GAG I GGT | CGT | ATC | GGC | GCA | TAA | TAG | TGA' | ; ; ; : . ■ — i 

|GGT|ACC| 
I Kpn I I 



BlS!80OCID: <WO GOOTSSSAS I 
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Table 510: Variegation of Ped-6-2 
for binding to Third Target 

K|r R|h 
m t|i D|n I|v W|l p|l 
I i I 96| 97| 98| 99|100| 
j cot I cag I cGG | TAA | CCT | ATG | AnA | rAT | rTT | TmG | CnT | 
I spacer | ) BstE IT [ 

G| s 

V|d Q|r n|d M|i R|t T|r 
I 101 1 102 I 103 I 104 1 105 1 106 | 
I GwT I CrG j rrT | ATr | AsG | AsA | 

M|k F|c 

e|vP|rYjhl|w n| 1 | rj Center of symmetry 
I 107 1 108 1 109 1 110 I 111 1 112 1 113 I ^ for priming 

I rwG I CsT I yAC | Tkk | AAC | CTC I AGO | cgt | att | aat | acg | cct | g -3 • 

I BSU36I I olig#506 \ 



Self priming for extension with Klenow 



5 • -. . CTC I AGG I cgt | att | \ 
3 • - g I tec j gca | taa | / 
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Table 511! Variegation for 
Selection with Fourth Target 



I cct I cag I cGG | TAA | CCT 
I spacer | } BstE II | 



V I R I H I L 
I 107 I 108 I 109 I 110 
[ GTG I CGT I CAC | CTT 



m|x|x|x|x|x|x( 

1 I 90| 91] 92| 93| 94| 95 | 
ATG I f zk I f zk I f zk I f zk | f zk 1 f zk | 



R I D I V I W I H I 
96 I 97] 98 I 99 1 100 I 
CGG I GAC I GTG | TGG | CAC | 
overlap 1 



VlR|N|l|TjR| 
101 1 102 I 103 I 104 I 105 I 106 | 
GTG 1 CGG I AAT | ATT | ACG | CGA | 

n I 1 I r I a,, 

111 I 112 I 113 I 

AAC I CTC I AGG | cgt [ cac \ ggc | 
BSU36I I spacer L 



f = (0.26 T, 0.18 C, 0.26 A, 0.30 G) 
Z = (0.22 T, 0.16 C, 0.40 A, 0.22 G) 
k = equimolar T and G 

There are (2^)6 = 2^0 = lo^ DNA sequences. 
There are 20^ = 6.4 x lo"^ protein sequences. 
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Table 512: Protein Ped-6-2-5-2 
Selected for Binding to Fourth Target 











m 


1 R 


1 T 


1 G 


1 F 1 C 


Q 










1 


1 90 


I 91 


92 


1 93 1 94 


95 










ATG 


1 CGT 


ACG 


GGG 


1 TTT 1 TGT 


CAG 










R 


1 D 


V 


W 


1 H 1 












96 


97 


98 


99 


1 1 












CGG 


GAT 


GTT 


TGG 


CAC| 










1 V 


R 


1 N 


I 


T 


R 1 










101 


102 


103 


104 


105 


106 1 










GTG 


CGG 


AAT 


ATT 


ACG 


CGA| 




1 V 1 R 


H 




n 1 


1 


r 


W i 


P 


r 1 




1 107 1 108 


109 


110 


1111 


112 


113 


114 


115 


116 1 




1 GTG 1 CGT 


CAC 


CTT 


AAC 


CTC 


AGG 


TGG 


CCC 


CGG| G 










-L 


Bsu36I 




1 Xma I| 




1 e 1 V 


1 




1 1 


V 


r 1 


k 1 


V 


a 1 




1 117 1 118 


119 


120 


121| 


122 


123 


124 j 


125 


126| 




1 AG|GTC 


CTT 


GAT 


CTT| 


GTT 


CGC{ 


AAG{ 


GTT 


GCT| 




1 PduM 


M. 


















1 e 1 e 


n 


g 


r 1 


s 1 


V 1 


n 1 


s 


e 1 




1 127 1 128 


129 


130 


131] 


132 


133 


134 1 


135 


136 1 




1 GAG 1 GAA 


AAC 


GGT 


CGGj 


TCC 


GTT 


AAC| 


TCT 


G 1 








J 


Rsr II 


-L 











I Hpa I I 



|i|y|n|r|v|in|els|f|k| 
I 137 I 138 I 139 I 140 I 141 1 142 1 143 I 144 | 145 | 146 | 

AG I ATC I TAT | AAT \ CGC | GTT [ ATG | GAG | TCG | TTC | AAG ] 

|Bql III 
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Table 512, continued 

|k^|e|g|r|i|g|a|.|.|.| 
1 147 1 148 1 149 1 150 1 151 1 152 1 153 I | | | 
I AAA I GAG 1 GGT | CGT | ATC | GGC | GCA | TAA | TAG | TGA | 

I GGT I ACC I 
I Kpn I I 
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Table 513 : Variegation 
of Length of Ped-6-2-5-2 

N|k V|g M|k 

i y|. y|. r|. 1|. 1|. 
1 137 1 138 I 139 I 140| 141 ( 142 | 
5 ' - cgacctagcAG | ATC | TAw^ | W3 AW4 | y^GA | kik2A | W3W4G | 
[ spacer | Bal II | 

F|c 

ej . s| . 1| . k| . 
I 143 I 144 I 145 I 146| 
I k3AG I TmjG I Tk2in2 | W2AG | - 



I|k 

k|. e|. g|. r|. 1]. g|.| . | . I 
I 147 I 148 1 149 I 150 1 151 | 152 | 153 1 | 
I W2AA I k3 AG I k3GA I y iGA | W3W4A | k3GA ] TAG | TGA | - 

|GGT|ACC|t- 3' 





= 0.65 


T 


and 


0.3^ 


A 


Yl 


== 0.65 


C 


and 


0. 


35 


T 


kl 


= 0.42 


G 


and 


0.58 


T 


1^2 


= 0.42 


T 


and 


0. 


58 


G 


^3 


= 0.65 


G 


and 


0.35 


T 


mi 


= 0.65 


C 


and 


0. 


35 


A 


^2 


= 0.42 


C 


and 


0. 58 


A 


W2 


= 0.65 


A 


and 


0. 


35 


T 


W3 


= 0.42 


A 


and 


0.58 


T 


W4 


= 0.42 


T 


and 


0. 


58 


A 



Each variegated residue produces about 35% stop codons. 



Because (0.65)^^ = 0.003, only 0.3 % of variegated genes 
encode a protein shortened by one residue. 
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Table for Example 6 
Table 600: Third finger domain of Jgc -tgs- P22 arc 

* ^ 
|m|e|Jc|plir|h| 

l l I 2 I 3 1 4 I 5 i 6 1 

I AGG I AGG I TAA I CCT I ATG I GAG I AAA I CCG I TAT I CAC I - 
|BstE II I 

* *. *. 

I'C |s|h|c|d|r|q|F|v|q| 
I "7 I 8 1 9 I 10 1 11| 12 1 13 I 14 1 15 1 16 1 
I TGC I TCA| CAC| TGT | GAT | CGT | CAG | TTT | GTC| CAA| - 
I Dra TIT I 

* *. * * * * ft 

I V I a I n I 1 I r I r I H I 1 I r 1 v^l H .| 
I 17 1 18 1 19 1 20 1 21 1 22 1 23 1 2 4 1 25| 26 | 27 1 
I GTG J GCC I AAC | TTA | AGA | CGT | CAT | CTA | CGC | GTG | CAC ] - 

I Bal 1 1 |Afl Il[Aat II | |m1u I | 

I AoaT. T| 

I <- linker s> | <s P22 arc 

* ********* 

|t|g|t|g|s|m|k|g|m| 
I 28 I 29 I 30 I 31 1 32 I 101 1 102 1 103 I 104 I 

I ACT I GGT I ACC I GGG I TCT I ATG I AAA I GGC I ATG I 
I Kpn I I 

****** 

|s|k|m|p|qif|nll|r|w| 
I 105 1 106 I 107 I 108 1 109 ] 110 | 111 | 112 j 113 | 114 | 

|tct|aag|atg|ccg|caa[ttc|aac|ctt|agg|tgg| - 

|BSU36I| .. ...^ , , , 
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|p|r|e|v|l|d|l|vlr|k| 
I 115 I 116 1 117 I 118 I 119 I 120 I 121 1 122 | 123 | 124 | 
I CCC I CGG I GAG | GTC | CTT | GAT | TTG | GTT | CGC | AAA | 

I Ava 1 1 PpuM I I 

I Xma 1 1 



|v|a|e|e|n|g|r|s|v|n|s| 
I 125 I 12 6 1 127 I 128 | 129 | 13 0 ] 131 1 132 | 133 | 134 1 135 | 
I GTC I GCT I GAA | GAG | AAT | GGC | CGG | TCC | GTG | AAT | TCT | 
|Ksp 632 I I Rsr II I [EcoR l| 

|e|i|y|n|r|v|in|e|s| 
1 136 1 137 I 138 I 139 1 140 | 141 1 142 | 143 | 144 | 
I GAG I ATC I TAT | AAT | CGT | GTT | ATG | GAA | AGC | 
| Bql III 

|f|k|k|e|g|r|i|g|a|.| 
I 145 I 146 1 147 1 148 [ 149 1 150 | 151 1 152 | 153 | 154 | 
I TTC I AAG I AAG | GAA | GGT | CGC | ATT | GGT | GCA | TAA | 

I . I . I 

1 155 1 156 I 
I TAG I TGA I GGA | TTC | 
iHindlll I 

ji indicates residues of zinc finger domain thought to 
contact DNA in model of Gibson et al . 

* indicates residues of zinc finger domain, linker, and 
Arc that may influence DNA binding. 
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CLAIMS 



1. A selection vector for selecting recipient cells 
transformed by such vector that express a protein that 
binds preferentially to a predetermined target DNA 
sequence borne by said vector, which vector comprises: 

a) a first operon, which operon comprises: 

i) a first binding marker gene, 

ii) a first promoter directing expression of said 
binding marker gene, and 

iii) a first copy of the target DNA sequence, where 
said target DNA sequence interferes substantially 
with expression of the first gene if and only if 
a protein expressed by the recipient cell binds to 
the target DNA sequence, 

b) a second operon, which operon comprises: 

i) a second binding marker gene, 

ii) a second promoter directing expression of said 
second binding marker gene, and 

iii) a second copy of the target DNA sequence, 
where said target DNA sequence interferes substan- 
tially with expression of the second gene if and 
only if a protein expressed by the recipient cell 
binds to the target DNA sequence, 

where the binding marker genes of said first and second 
operons are different, and where, when said transformed 
cells are exposed to forward selection conditions the 
gene products of said first and second binding marker 
genes are deleterious or lethal to the recipient cell. 



2. The vector of claim 1 in which at least one of the 
operons confers a genotype selected from the group 
35 consisting of qalT,K% tetA % lac2 % pheS *, araP \ thvA *. 

CEE*, EiirF*, ptsM% secAVmalEVlacZ*, offipA* , btuB* , lamB* . 
;^2nA*, cir*, tsx*, aroP*, cysK*/ and dctA*. 
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The vector of claim 1 wherein the binding marker genes 
are functionally unrelated. - . - ^ - 



5 .: 4 



10 



15 



20 



The vector of claim 3 wherein the first and second 
operons confer., respectively, a' pair of genotypes 
selected from the group consisting of: 



25 



(a 
(b 
(c 

(d 
(e 

(f 

(g 

(h 
(i 
(j 
(k 
(1 
(m 
(n 
(o 
(P 

(q 



galT . K* and tetA* ? 

argP* and pheS* ? 

lacZ* and tetA*? 

dctA* and cvsK* ? 

crp" ^ and thyA*; 

lamB* and thyA*; 

gecAVmalJEV lacZ" ^ and ovrF* ; 

tsx* and cysK* ; 

dctA* and thvA* ? 

qalT . and pheS* - - 

tet^* and thvA^ ; - 

PtsM* and thyA*; 

ompA* and pyrF*; 

btuB* and pvrF* ? 

•tonA* and era IT. K* ; 

cir"^ and cysK*; and 

aroP* and lacZ* . 



5. The vector of claim 1 wherein the promoters of said 
first and second operons are different. 

3 0 6. The vector of claim 2 wherein the degree of homology 
between the first and second promoters is less than 50% 
in the region between the -10 region of the promoter and 
the base at which transcription -is' initiated. 



35 , 7. 



The vector of claim, -r.l, wherein Uat-raeast^ M^ of said 
^operons comprises a j.plurality^ of copies ^ of .^the target 
DNA sequences, wherein 'reach copy islisositidned so that 
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the target DNA sequence interferes substantially with 
expression if and only if a protein expressed by the 
recipient cell binds to the target DNA sequence. 

The vector of claim 1, further comprising a plurality 
of genetic elements essential to the maintenance of the 
vector or the survival of the transformed cells under 
conditions that select for presence of said vector, said 
operons and said genetic elements being positioned on 
said vector so no single deletion event can render non- 
functional more than one of said operons without also 
rendering nonfunctional one of said essential genetic 
elements. 



15 9. 



The vector of claim 8, wherein at least one of said 
genetic elements comprises a selectably beneficial or 
essential gene, and a control promoter operably linked 
to said beneficial or conditionally essential gene, but 
where no instance of said target DNA sequence is associ- 
20 ated with said genetic element. 



10. 



The vector of claim 9 wherein the control promoter is 
essentially identical to the promoter of one of said 
selectable binding marker operons, so that proteins 
binding to the latter promoter will also bind to the 
control promoter and thereby inhibit expression of said 
beneficial or essential gene. 

11. The vector of claim 1, wherein under reverse selection 
3 0 conditions the gene products of said binding marker 

genes are beneficial or conditionally essential to the 
transformed cells. 

12. The vector of claim 11, wherein each of the first and 
35 second operons confers a phenotype selected 

independently but non-identically from the group 
consisting of: qalT,K% tetA% lacZ% pheS" . arqP ". 
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tllYA", CTE*, mrF*, ptsM*, secAVinalEVlacZ*, ompA* , btuB*, 
lamB*, tonA*, cir*, :fesx*, aroP*, c^JSK*, and dctA*. 

13. The vector of claim 1, further comprising a 
5 nondeleterious cloning site so positioned that insertion 

of a foreign gene at such site does not inactivate said 
first or second operons or any genetic element of said 
vector required for its maintenance within the 
transformed cell. 

14 . The selection vector of claim l in which the target 
sequence associated with at least one of said operons 
is positioned within the RNA-polymerase binding site of 
the promoter of the operon. 

15. The selection vector of claim 1 in which the target 
sequence associated with at least one of said operons 
is positioned upstream of the -35 region of the promoter 
of the operon. 

16. The selection vector of claim 1 in which the target 
sequence associated with at least one of said operons 
in positioned downstream of the -lo region of the 

.;. promoter of the operon. 

25 

17. The selection vector of claim 1 in which the target 
sequence is positioned so that the most 5' base of the 
target sequence is transcribed into the +1 base or the 
+2 base of the mRNA transcribed under the direction of 
the promoter of the operon. 

18 . The selection vector of claim 6 in which one of said 
genetic elements is the origin of replication of said 
vector . 

35 



20 
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19. The vector of claim 1, further comprising a gene f pdbp ) 
coding on expression for a potential DNA-binding protein 
or polypeptide, said gene comprising: 

a) a coding region that codes on expression for a 
polypeptide, each domain of said polypeptide having 
at least 50% sequence identity to a known DNA- 
binding domain, and 

b) a promoter operably linked to said coding region 
for controlling its expression. 

20. The vector of claim 11, wherein said promoter of said 
gene is an inducible or repressible promoter. 



21. A variegated population of vectors according to claim 
19, the variegation occurring within the pdbp gene so 
that said vectors collectively can express a plurality 
of different but sequence-related potential DNA-binding 

20 proteins. 

22. A cell culture comprising a plurality of cells, each 
cell bearing a selection vector according to claim 1, 
said cell bearing a gene coding on expression for a 

25 potential DNA-binding protein, where said cells 

collectively can express a plurality of different but 
sequence-related potential DNA-binding proteins. 



23. A method of obtaining a gene coding on expression for 
a novel DNA-binding protein or polypeptide that 
preferentially binds a predetermined DNA target sequence 
in double stranded DNA, comprising: 

(a) providing a cell culture according to claim 22; 

(b) causing the cells of such culture to express said 
potential DNA-binding proteins or polypeptides; 
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(c) exposing the cells to forward selection conditions 
to select for cells which express a protein or 
polypeptide which preferentially binds to a said 
target DNA sequence; and 



25. A 

20 

26 



(d) recovering the selected cells bearing a gene coding 
on expression for such protein or polypeptide. 

10 24. A method of producing a DNA-binding protein or 
polypeptide which preferentially binds a predetermined 
double stranded DNA target, which comprises providing 
a gene obtained by the method of claim 15 which codes 
on expression for such protein or polypeptide, 
expressing the gene in a suitable host cell, and 
recovering said protein or polypeptide. 

non-naturally occurring DNA binding protein or 
polypeptide produced by the method of claim 24. 

A method of obtaining a protein or polypeptide which 
may be used to specifically repress a coding or 
regulatory element of interest which comprises 
identifying an ultimate target sequence within such 
element and obtaining a protein or polypeptide which 
preferentially binds to such ultimate target sequence 
by the method of claim 23. 

27. A method of producing a protein which binds to a 
predetermined ultimate target double stranded DNA 
sequence, said sequence being nonpalindromic, said 
sequence comprising a left target-subsequence and right 
target subsequence each of at least 4 base pairs length, 
said method comprising: 

: . - .: (a) providing a first gene encoding a first DNA-binding 
oligomeric protein binding t6--a first target 
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sequence and a second gene encoding a second DNA- 
binding oligomeric protein binding to a second 
target sequence, wherein said first and second DNA- 
binding proteins each have at least two essentially 
dyad-synunetric DNA-binding domains, where said 
first target sequence conprise said left target 
subsequence and a palindrome of the left target 
subsequence, whereby one of the DNA-binding domains 
of the first DNA-binding protein binds to said left 
target subsequence, and where said second target 
sequence comprises said right target subsequence 
and palindrome of the right target subsequence, 
v/hereby one of the DNA-binding domains of the 
second DNA-binding protein binds to the right 
target subsequence. 



20 



25 



(b) providing a host cell carrying said first and 
second genes, each operably linked to a promoter 
functional in the host cell, 

(c) co-expressing said first and second genes so as to 
obtain a heterooligomeric DNA-binding protein com- 
prising a DNA-binding domain recognizing the left 
target subsequence and a DNA-binding domain 
recognizing the right target subsequence; and 



(d) separating said heteromultimeric protein from other 
co-expression products of said first and second 
genes on the basis of its affinity for the ultimate 
3 0 target DNA. 



8. The method of claim 27, further comprising crosslinking 
the heterooligomeric protein in vitro . 

9. A method of obtaining genes encoding a heterooligomeric 
protein which binds to a predetermined ultimate target 
double stranded DNA sequence, said sequence being 



10 
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y. nonpalindromic, said sequence comprising a left target- 

subsequence and right target subsequence each of at 
. least 4 base pairs length, said method comprising: 

5 (a) providing a first gene encoding a first DNA-binding 

oligomer ic. protein binding to a first target 
sequence and a second gene encoding a second DNA- 
binding oligomeric protein binding to a second 
? target sequence, wherein said first and second DNA- 

^ binding proteins each have at least two dyad- 

symmetric DNA-binding domains, where said first 
target sequence comprises said left target 
subsequence and a palindrome of the left target 
siabsequence, whereby one of the DNA-binding domains 
°f the first DNA-binding protein binds to said left 
target subsequence, and where said second target 
sequence comprises said right target subsequence 
a palindrome of the right target subsequence, 
^^^^ehy one of the DNA-binding domains of the 
second DNA-binding protein binds to the right 
target subsequence, 

(b) variegating one of said first or second genes and 
reverse selecting for expression of a first 
oligomerization mutant protein encoded by k 
variegant of said variegated gene which is no 
longer capable of forming a homooligomer that can 
bind to one of said first or second target sequen- 
ces, respectively, and verifying that said oligo- 
merization mutant protein maintains a tertiairy 
structure similar to the protein from which it 
descended. 



15 



20 



35 



(c) variegating the other of said first or second 
3^ genes, . % ■ :sy 
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(e) 



providing a host cell carrying the gene encoding 
said first oligomerization mutant protein and said 
variegated gene of step (c) , each operably linked 
to a promoter functional in the host cell, and 

forward selecting for expression of a second 
oligomerization mutant protein which is capable of 
forming a heterooligomer with said first oligomer- 
ization mutant protein, said heterooligomer binding 
said ultimate target DNA sequence, and 



(f) isolating the genes encoding said heterool 



igomer . 



A method of producing a heterooligomeric protein which 
binds to a predetermined nonpalindromic double stranded 
ultimate target DNA sequence which comprises providing 
a host cell bearing the genes obtained by the method of 
claim 29 and coding for such protein, each gene being 
operably linked to a promoter functional in the host 
cell, and expressing said gene. 

The method of claim 30, further comprising crosslinking 
the heterooligomeric protein in vitro . 

A method of obtaining genes encoding a heterooligomeric 
protein which binds to a predetermined ultimate target 
DNA sequence, said sequence comprising a left target 
subsequence and a right target subsequence, each of at 
least 4 base pairs length, said ultimate target DNA 
sequence being non-pal indromic, said method comprising: 

(a) variegating a first gene encoding a first DNA 
binding domain of a known oligomeric DNA-binding 
protein having dyad-symmetric DNA-binding domains, 
said protein recognizing a known, essentially 
palindromic DNA sequence, and forward selecting for 
expression of a first DNA-recognition mutant 
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protein that is capable of forming a homooligomer 
that binds to a left-symmetrized target DNA 
sequence comprising said left target subsequence 
and a palindrome thereof, 

variegating a second copy of said first gene 
encoding said first DNA binding domain of a known 
oligomeric DNA-binding protein having dyad-sym- 
metric DNA-binding domains, said protein recogniz- 
ing a known, essentially palindromic DNA sequence, 
and forward selecting for expression of a second 
DNA-recognition mutant protein that is capable of 
forming a homooligomer that binds to a right- 
synmietrized target DNA sequence comprising said 
right target subsequence and a palindrome thereof, 

variegating either the gene encoding said first 
DNA-recognition mutant protein or the gene encoding 
said second DNA-recognition mutant protein at 
codons encoding amino acids forming the dimeriza- 
tion interface of said mutant protein, reverse 
selecting for an initial dimerization mutant 
protein that cannot bind to a target DNA sequence 
comprising the corresponding symmetrized target of 
said first or second DNA-recognition mutant because 
it cannot dimerize, and verifying that said initial 
dimerizatin mutant protein maintains a tertiary 
structure similar to the protein from which it is 
descended ; 

variegating the gene encoding the other DNA- 
recognition mutant protein and forward selecting, 
in cells that express the initial dimerization 
mutant for. expression of a - heterooligomeric 
protein, that is capable of binding to said non-"^- 
palindromic DNA target which comprises said left 
target and said right target. 
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33. A method of producing a heterooligomeric protein which 
binds to a predetermined nonpalindromic double stranded 
ultimate target DNA sequence which comprises providing 
a host cell bearing the genes obtained by the method of 
claim 32 and coding for such protein, each gene being 
operably linked to a promoter functional in the host 
cell, and expressing said genes, 

34. The method of claim 32, further comprising crosslinking 
said heterooligomeric protein in vitro . 



35. The method of claim 23, wherein a gene coding on 
expression for a polypeptide is variegated, said 
polypeptide having a length of 2 5-50 amino acids and 
being capable of lying in the major groove of B-DNA. 

36. The method of claim 23 wherein the gene codes on 
expression for a molecule comprising a DNA-binding 
domain of 25-50 amino acids and a custodial demain that 
substantially inhibits the degradation of the molecule 
by intracellular enzymes. 

37. The method of claim 23 wherein said protein is a 
2 5 globular protein. 

38. The method of claim 37 wherein a gene coding on 
expression for a known DNA binding protein having a 
helix-turn-helix DNA binding motif is variegated. 



39. The method of claim 23 wherein a gene encoding a known 
DNA binding protein picked from the group consisting of 
Cro from phage )\, cl repressor from phage )\, cro from 
phage 4 34, cl repressor from phage 4 34, P22 repressor^ 
^ coli tryptophan repressor, coli CAP, P22 Arc, P22 
Mnt, coli lactose repressor, MAT-al-alpha2 from 

yeast. Polyoma Large T antigen, SV4 0 Large T antigen. 
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40. 



41. 



Adenovirus ElA, and TFIIIA from Xenopu^ laevis is 
variegated to obtain genes coding on expression for a 
plurality of potential target DNA-binding proteins. 

The method of claim 23 in which the DNA binding protein 
comprises a plurality of zinc finger DNA-binding 
domains. 

The method of claim 23 in which the initial potential 
DNA-binding protein comprises one or more zinc-finger 
DNA-binding domains fused to a DNA-binding protein 
picked from the group consisting of cro from phage )\ 
CI repressor from phage )\, cro from phage 434, al 
repressor from phage 434, P22 repressor, coli 
tryptophan repressor, coli CAP (also known as CRP) , 
P22 Arc, P22 Mnt, coli lactose repressor, MAT-al- 

alpha2 from yeast. Polyoma Large T antigen, SV40 Large 
T antigen. Adenovirus ElA, and TFIIIA from Xenopus. 
laevis ^ 

42. The cell culture of claim 22 wherein the cells are 
bacterial cells • 

43. The cell culture of claim 42 in which the cells are 
2^ EscheT-Tnhi;=. coli cells. 



44. 



*5: 



The method of claim 23 in which the cells prior to 
transformation are of a GalE"- GalT'- GalK'- Tet^ 
phenotype, the binding marker genes are the tet and 
aalT^ genes, and the forward selection condition is 
cultivation of the cells in a medium containing 
galactose, or fusaria acid, or both, or substances 
metabolized into or catalyzing , the production of 
galactose or fusaria acid. 

The method of claim 44 .. in which , the vector further 
contains an origin of . replication , and . an antibiotic 
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resistance gene, and cells are cultured in medium that 
further correspondence, an antibiotic for which 
resistance is conferred by said antibiotic resistance 
gene. 



46. 



The population of claim 13, wherein the level of 
variegation is such that from lo'^ to lo' different 
potential DNA-binding proteins can be expressed. 
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